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Abstract 


Increasingly,  real  world  problems  require  multiple  predictions  while  traditional  supervised 
learning  techniques  focus  on  making  a  single  best  prediction.  For  instance  in  advertisement 
placement  on  the  web,  a  list  of  advertisements  is  placed  on  a  page  with  the  objective  of 
maximizing  click-through  rate  on  that  list. 

In  this  work,  we  build  an  efficient  framework  for  making  sets  or  lists  of  predictions  where 
the  objective  is  to  optimize  any  utility  function  which  is  (monotone)  submodular  over  a  list  of 
predictions.  Other  examples  of  tasks  where  multiple  predictions  are  important  include:  grasp 
selection  in  robotic  manipulation  where  the  robot  arm  must  evaluate  a  list  of  grasps  with  the 
aim  of  finding  a  sucessful  grasp,  as  early  on  in  the  list  as  possible  and  trajectory  selection 
for  mobile  ground  robots  where  given  the  computational  time  limits,  the  task  is  to  select  a 
list  of  trajectories  from  a  much  larger  set  of  feasible  trajectories  for  minimizing  expected  cost 
of  traversal.  In  computer  vision  tasks  like  frame-to-frame  target  tracking  in  video,  multiple 
hypotheses  about  the  target  location  and  pose  must  be  considered  by  the  tracking  algorithm. 

For  each  of  these  cases,  we  optimize  for  the  content  and  order  of  the  list  of  predictions. 

Crucially-  and  in  contrast  with  existing  work  on  list  prediction  -  our  approach  to  pre¬ 
dicting  lists  is  based  on  very  simple  reductions  of  the  problem  of  predicting  lists  to  a  series 
of  simple  classification/regression  tasks. 

This  provides  powerful  flexibility  to  use  any  existing  prediction  method  while  ensuring 
rigorous  guarantees  on  prediction  performance.  We  analyze  these  meta-algorithms  for  list 
prediction  in  both  the  online,  no-regret  and  generalization  settings. 

Furthermore  we  extend  the  methods  to  make  multiple  predictions  in  structured  output 
domains  where  even  a  single  prediction  is  a  combinatorial  object,  e.g.  ,  challenging  vision 
tasks  like  semantic  scene  labeling  and  monocular  pose  estimation. 

We  conclude  with  case  studies  that  demonstrate  the  power  and  flexibility  of  these  re¬ 
ductions  in  problems  from  document  summarization,  prediction  of  the  pose  of  humans  in 
images,  to  predicting  the  best  set  of  robotic  grasps  and  purely  vision  based  autonomous 
flight  in  densely  cluttered  environments. 
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CHAPTER 


Introduction 


Many  real  world  problems  require  or  benefit  from  multiple  predictions.  Consider  adver¬ 
tisement  placement  on  the  web:  a  natural  formulation  involves  the  prediction  of  a  list  of 
advertisements  to  show-  rather  then  a  single  choice-  on  a  web-page  with  the  objective  of 
maximizing  click-through  rate  [Bishop  et  al.  2006]  of  that  list. 

Other  examples  of  domains  where  multiple  predictions  are  important  include:  web  search 
where  a  list  of  results  are  presented  in  response  to  a  query  [Carbonell  and  Goldstein  1998; 
Manning  et  al.  2008;  Horvitz  2001],  extractive  document  summarization  where  multiple  rep¬ 
resentative  sentences  are  selected  from  a  document  to  summarize  it  [Ross  et  al.  2013b],  grasp 
selection  in  robotic  manipulation  where  the  robot  arm  must  evaluate  a  list  of  grasps  with 
the  aim  of  finding  a  successful  grasp  as  early  on  in  the  list  as  possible,  trajectory  selection 
for  mobile  ground  robots  where  the  task  is  to  select  a  list  of  trajectories  from  a  much  larger 
set  of  feasible  trajectories  for  minimizing  expected  cost  of  traversal  [Dey  et  al.  2012],  or  in 
frame-to-frame  target  tracking  in  video  where  multiple  hypotheses  about  the  target  loca¬ 
tion  and  pose  must  be  considered  by  the  tracking  algorithm  [Pirsiavash  et  al.  2011;  Park 
and  Ramanan  2011].  This  problem  of  returning  a  list  of  predictions  to  a  query  is  remarkably 
common  across  tasks  in  fields  like  computer  vision,  path  planning,  manipulation,  information 
retrieval,  optimal  control,  computational  biology,  and  human-robot  interaction. 

Traditionally,  machine  learning  has  developed  mature  tools  with  well  understood  theory 
and  practice  for  producing  a  single  best  prediction.  In  this  work  we  build  simple,  provably 
efficient  tools  for  making  multiple  predictions  whenever  the  objective  is  to  optimize  any 
objective  function  which  is  (monotone)  submodular  over  a  list  of  predictions.  In  spite  of  the 
ubiquity  of  such  list  prediction  problems,  tools  for  them  have  lagged  behind  the  state-of-the- 
art  for  single  best  prediction. 

Drawing  upon  recent  advances  in  the  optimization  literature  [Streeter  and  Golovin  2008; 
Radlinski  et  al.  2008;  Munagala  et  al.  2005;  Horvitz  2001;  Golovin  and  Krause  2010;  Krause 
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and  Guestrin  2008],  we  develop  a  set  of  methods  for  learning  to  make  multiple  predictions 
with  theoretical  performance  guarantees  in  a  number  of  broad  settings.  For  each  setting  we 
answer  the  questions  1)  what  is  the  right  subset  of  predictions  to  evaluate  given  budgeted 
computation  time  and  or  2)  what  is  the  right  order  of  evaluating  these  predictions  when 
mission  success  is  critical.  The  settings  can  be  broadly  grouped  into  the  following  categories: 

•  Library  of  choices:  In  settings  where  the  task  is  to  predict  a  list  from  an  a  priori  set 
of  large  number  of  choices  and  the  only  information  available  is  via  executing  that  list 
e.g.  ,  path  planning  for  autonomous  mobile  robots  where  the  task  is  to  select  a  subset 
of  trajectories  to  evaluate  on  the  terrain  around  the  robot,  from  a  large  set  of  feasible 
trajectories.  The  true  quality  of  a  trajectory  is  revealed  only  after  it  has  been  traversed 
by  the  robot. 

•  Contextual  library  of  choices:  A  library  of  choices  where  for  every  available  choice 
additional  information  is  available  via  features  of  the  environment  e.g.  ,  advertisement 
placement,  where  for  a  given  context  on  a  web  page  a  list  of  advertisements  need  to  be 
selected  to  be  placed  in  the  side  pane.  Another  example  is  making  multiple  predictions 
in  structured  output  problems  like  monocular  pose  estimation  where  the  task  is  to 
annotate  the  pose  of  a  object  from  a  single  image.  In  this  case  the  context  is  the 
features  of  the  image  and  multiple  predictions  are  necessary  per  image  frame  for  a 
subsequent  tracking  algorithm  to  find  the  right  pose. 

Additionally  the  above  settings  can  each  be  further  subdivided  into  two  additional  cate¬ 
gories: 

•  Static:  In  this  setting  the  order  in  which  queries  are  received  by  the  system  is  inde¬ 
pendent  of  what  list  of  choices  are  predicted  for  each  query.  The  example  of  monocular 
pose  estimation  is  a  case  where  the  list  of  predictions  for  a  given  query  image  does  not 
influence  in  any  manner  the  choice  of  the  next  query  image. 

•  Dynamic:  In  contrast  to  the  static  setting,  in  the  dynamic  setting  the  choice  of  list 
of  actions  predicted  directly  influences  the  next  query  the  system  will  receive.  A  good 
example  is  path  planning  for  autonomous  mobile  robots  where  predicting  which  sets 
of  trajectories  to  evaluate  in  the  current  step  decides  the  next  place  in  the  world  the 
robot  will  travel  to  in  the  next  time  step.  Such  dynamic  settings  also  arise  often  in 
interactive  systems.  Other  concrete  examples  include  web  search,  manipulation,  and 
conversational  robots. 

There  are  two  main  properties  of  a  list  that  one  cares  about:  1)  For  a  list  of  budgeted 
length,  is  the  right  answer  contained  within  the  list  (i.e.  a  list  with  high  recall)?  2)  What  is 
the  order  of  evaluating  a  library  of  items  so  that  the  correct  choice  is  encountered  as  early  on 
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as  possible?  In  both  cases  we  directly  optimize  for  the  objective  of  interest  during  training, 
without  incorporating  any  additional  parameters. 

We  will  first  discuss  algorithms  (Chapter  2)  and  provide  a  concrete  example  (Chapter 
2.1)  and  then  consider  analysis  (Chapter  2.2).  In  Chapter  3  we  will  detail  more  efficient 
versions,  their  detailed  analysis  and  more  case  studies.  Finally  we  will  be  considering  the 
relationship  with  other  approaches  in  literature  for  providing  multiple  predictions  (Chapter 
5). 

Chapter  2  efficiently  reduces  the  greedy  algorithm  to  consider  features  of  the  environment 
by  extending  it  to  consider  lists  of  policies  instead  of  lists  of  items.  We  term  this  procedure 
ConSeqOpt  short  for  “CONtextual  SEQuence  OPTimization".  By  competing  against  the 
best  list  of  policies  in  a  given  policy  class  we  produce  better  lists.  Applications  to  robot 
motion  planning  for  manipulation  are  presented. 

Chapter  3,  3.1  proposes  a  more  efficient  version  of  ConSeqOpt  which  we  term  as  SCP, 
short  for  “Sequential  Contextual  Prediction”.  Chapter  3.2  demonstrates  additional  case  stud¬ 
ies  for  both  SCP  and  ConSeqOpt. 

Chapter  4  extends  the  approach  of  ConSeqOpt  to  structured  output  problems  like  se¬ 
mantic  scene  classification  and  monocular  human  pose  estimation.  The  exponentially  large 
label  set  requires  care  in  implementing  or  approximating  the  weighting  scheme  of  Chapter  2. 
We  term  the  resulting  family  of  algorithms  SeqNBest,  a  contraction  of  “SEQuence  N-Best". 

In  Chapter  5,  we  review,  compare  and  contrast  alternate  approaches  to  making  multiple 
predictions.  These  include  methods  based  on  extracting  multiple  predictions  from  a  single 
statistical  model  Batra  et  al.  [2012],  heuristic  objective  functions  for  constructing  diverse  pre¬ 
dictions  Carbonell  and  Goldstein  [1998],  and  ad-hoc  schemes  for  developing  multiple  models 
Guzman- Rivera  et  al.  [2014b]. 

In  Chapter  6  we  apply  the  paradigm  of  making  multiple  predictions  to  enabling  pure 
vision-based  autonomous  unmanned  aerial  vehicle  (UAV)  flight  through  dense  clutter.  We 
demonstrate  real  world  experiments  with  an  off-the-shelf  quadrotor  and  show  that  by  making 
multiple  obstacle  depth  predictions,  up  to  71%  increase  in  average  flight  distance  can  be 
achieved  over  making  a  single  best  prediction. 

We  will  conclude  with  still  open  problems  and  future  directions  in  Chapter  7. 
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CHAPTER 


Contextual  Optimization  of  Lists 


We  begin  by  delving  straight  into  the  algorithm  for  predicting  a  list  of  items  given  some 
context  of  the  example.  By  the  word  “context”  we  mean  features  of  the  example  under 
consideration.  A  list  is  an  ordered  sequence  of  items  of  finite  length.  In  Chapter  2.1  we  will 
give  a  concrete  case  using  this  algorithm  for  robot  motion  planning.  For  now  assume  that  we 
have  a  library  of  items  A  and  a  dataset  V  of  \V\  examples.  We  will  use  i  to  index  position  in 
the  list  and  j  to  index  examples.  We  use  {. . .}  to  indicate  a  list.  A  list  of  N  items  is  indicated 
by  Each  example  is  represented  by  a  feature  vector  xj  of  length  L.  By  stacking  xj 

we  create  a  matrix  X  of  size  \V\  x  L.  Thus  each  row  of  X  represents  a  feature  vector.  S  is 
the  space  of  all  possible  lists  that  can  be  made  from  items  in  library  A.  Given  such  a  feature 
matrix  X  and  library  A  our  aim  is  to  to  predict  a  list  of  items  S  G  S  of  length  N  <  \V\ 
for  each  example,  so  that  we  can  maximize  F  —  v[f(Sj,x j)]  where  /  :  S  x  d  -»•  [0, 1] 

is  some  task  specific  utility  function  we  are  interested  in  maximizing.  Note  that  /  is  a  list 
function ,  i.e.  it  takes  in  lists  of  items  Sj  as  input  and  evaluates  them  on  an  example  d 
represented  by  feature  vector  xj  and  returns  a  non-negative  score.  The  example  d  is  drawn 
from  a  (unknown)  distribution  (d  ~  V)  of  examples.  Later  on  we  will  show  how  special 
mathematical  properties  of  /  lead  to  performance  guarantees  of  the  algorithms  we  present  in 
this  work.  We  term  the  first  algorithm  we  present  below  as  ConSeqOpt  Batch  which  is 
short  for  “CONtextual  SEQuence  OPTimization”.  Due  to  the  batch  nature  of  training  where 
it  sees  all  the  training  data  before  it  predicts  a  list,  we  append  the  word  “Batch”. 

The  input  to  ConSeqOpt  Batch  (Algorithm  1)  is  the  desired  list  length  TV,  a  multi¬ 
class  classifier  training  procedure,  the  dataset  V  of  examples  and  library  of  items  A.  The 
algorithm  proceeds  by  training  a  classifier  7q  G  T~L  in  the  loop  indexed  by  i  and  T~L  is  the 
hypothesis  space.  Line  2  evaluates  the  function  computeMarginalLoss  and  outputs  a  matrix 
Mlj.  Ml.  and  the  feature  matrix  X  are  then  used  to  train  a  multi-class  cost-sensitive 
classifier  7 in  line  3. 

We  use  a  toy  example  to  detail  every  step  of  the  training  process.  Suppose  we  have  a 
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Algorithm  1  ConSeqOpt  Batch:  Algorithm  for  training  using  classifiers 
Input:  List  length  N, 

Multi-class  cost  sensitive  classifier  training  procedure:  R^l^  x  Rl^W^I  — > 
Dataset  V  of  \V\  number  of  examples  and  associated  features  X  : 

Library  of  items  A 

Output:  list  of  classifiers  {tti ,  7T2, . . . ,  7tn}  (tt  :  RI:dIxL  -A  A^) 

1:  for  i  =  1  to  N  do 

2:  MLi  <—  computeMarginalLoss(X,  {tt i,  7T2, . . . ,  7Ti_i},  A) 

3:  7Tj  <—  train(X,  MlJ 

4:  end  for 


Algorithm  2  computeMarginalLoss(X,  {ti,  7r2, . . . ,  7Ti_i,  A)} 

Input:  Features  of  the  dataset  X  :  RI:dIxL, 

List  of  multi-class  cost  sensitive  classifiers  {tt i,  7T2, . . . ,  7Ti_i }  (tt  :  R^IxL  -A  , 

Library  of  items  A 

Output:  MLi 

1:  for  j  =  1  to  \V\  do 
2:  for  k  —  1  to  |A|  do 

3:  MBi[j,  k]  =  /(7Ti(Xj)  ®  7T2(Xj)  .  .  .  7Ti_i  (xj)  ®  A[fc] ,  Xj) 

-/(TTl(Xj)  ©  7T2(Xj)  .  .  .  7Ti_1(Xj),Xj) 

4:  end  for 

5:  end  for 

6:  MLi  =  convertGainsToLosses(MBi) 


Algorithm  3  ConSeqOpt  Batch:  Algorithm  for  inference  using  classifiers 

Input:  List  of  multi-class  cost-sensitive  classifiers  {tti ,  tt2 , . . . ,  7Tn}, 

Test  example  feature  vector  x  :  RlxL 

Output:  List  of  selected  items  S  =  {ai,  <22, . . . ,  ajy} 

1:  Initialize  empy  list  5  =  1} 

2:  for  i  —  1  to  N  do 

3:  S  S  ®  7T i  (x) 

4:  end  for 
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Figure  2.1:  Illustration  of  two  iterations  of  ConSeqOpt  Batch  using  multi-class  classifiers 
on  a  toy  dataset  of  4  examples  and  library  of  10  items. 

dataset  of  |X>|  =  4  examples,  each  example  is  described  by  a  feature  vector  of  dimension 
L  —  8.  So  X  is  of  shape  4x8.  Also  suppose  that  we  have  a  library  of  \A\  =  10  items. 
Figure  2.1  shows  the  toy  X  matrix,  the  library  A  and  the  values  of  important  variables  for 
two  iterations  of  the  loop  in  Algorithm  1.  We  will  now  walk  through  the  steps  to  illustrate 
how  these  numbers  were  obtained: 


Iteration  1: 

•  computeMarginalLosses:  In  the  first  iteration,  line  2  executes  computeMarginalLosses. 
It  takes  in  X,  A  and  the  output  of  previously  trained  classifiers  on  X  as  input.  Since  at 
this  point  no  previous  classifiers  have  been  trained,  there  are  no  previous  predictions. 
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We  first  create  the  entries  of  the  matrix  Mbx  by  evaluating  our  objective  function  / 
on  every  example,  for  every  item  in  the  library  A.  The  columns  correspond  to  items. 
For  ease  of  illustration  only  the  first  row  has  been  filled  out.  Say  that  when  item  a\  is 
input  to  /  on  the  first  example,  the  output  is  /(ai,  xi)  =  0.8.  Similarly  /(a2,  xi)  =  0.1, 
/(a3,  xi)  =  0.2,  etc,  until  all  columns  are  filled.  This  is  repeated  for  every  example  until 
the  whole  matrix  is  filled.  In  this  step  the  entries  in  each  row  of  Mbx  are  subtracted 
from  the  maximum  entry  in  that  row,  and  rescaled  to  [0, 1]  to  create  Mlx. 

•  train:  In  this  step  the  train  function  is  executed.  This  takes  in  the  Mlx  matrix  com¬ 
puted  in  the  last  step  and  the  features  of  the  examples  matrix  X  as  input  and  trains 
a  cost-sensitive  multi-class  classifier  tv\  [Langford  and  Beygelzimer  2005].  Each  item  in 
A  constitutes  a  class  and  the  matrix  mLi  constitutes  the  per  example  per  class  cost  of 
predicting  an  item  for  that  example.  For  example,  for  the  first  example  the  classifier 
will  see  that  the  best  item  to  pick  is  a 4  since  it  is  0.0  cost,  while  picking  a q  is  the  worst 
since  it  pays  a  cost  of  1.0. 

Iteration  2: 

•  computeMarginalLosses:  In  the  second  iteration,  line  2  executes  computeMarginalLosses 
with  X,  A  and  the  list  of  classifiers  trained  up  to  now,  which  in  this  case  is  just  the 
first  classifier  ni  from  iteration  1.  To  create  the  entries  of  the  Mb2  matrix  we  first 
calculate  the  item  chosen  by  ni  on  each  example.  So  for  every  example  the  item  that 
7Ti  thought  to  be  the  best  is  obtained.  Due  to  imperfect  learning  ti\  may  not  pick  the 
best  item  for  every  example.  For  example,  in  the  toy  example  it  picks  <25  for  the  first 
example  even  though  a 4  is  the  best  one  since  it  has  0.0  cost.  In  Figure  2.1  we  show 
this  step  separately  by  using  the  variable  Y7Fl  which  stores  this  chosen  item  for  every 
example. 

Secondly  the  gain  in  /  obtained  by  adding  an  item  from  A  is  calculated.  For  example, 
for  the  first  position  in  the  first  row  we  calculate  f(a$  ©  ai,xi)  —  /(a5,xi),  for  the 
second  position  /(as  ©  a2,xi)  —  /(as,x  1)  and  so  on.  In  this  example  the  numbers  are 
calculated  based  on  the  function  /  being  the  max  function,  /(as  ©  ai,xi)  is  therefore 
max(0.5, 0.8)  =  0.8.  Note  that  /(ai,xi)  =  0.8  from  MBr  Since  /(as,xi)  =  0.5,  this 
leads  us  to  /(as  ©  ai,  xi)  —  /(as,  xi)  =  0.8  —  0.5  =  0.3.  The  rest  of  the  entries  are  filled 
up  this  way.  Once  Mb2  is  calculated,  the  entries  are  then  subtracted  from  1  to  turn 
gains  to  loss.  This  is  the  Ml2  matrix. 

•  train:  As  in  the  first  iteration  the  train  function  is  executed  to  train  a  multi-class  cost- 
sensitive  classifier  7r2.  This  uses  the  rows  of  Ml2  for  obtaining  per  example  per  class 
costs. 
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The  procedure  for  computeMarginalLoss  is  formally  described  in  Algorithm  2.  Algorithm 
3  details  the  simple  inference  procedure.  On  a  test  example  represented  by  feature  vector  x 
the  list  of  classifiers  {7Ti,  7r2, . . . ,  7tn}  are  invoked  to  construct  the  list  of  items  S. 

Algorithm  4  shows  the  equivalent  algorithm  using  regressors  instead  of  using  classifiers 
(Algorithm  1).  This  alternate  formulation  has  the  advantage  of  being  able  to  introduce 
new  items  to  the  library  A  without  retraining  the  sequence  of  regressors.  Instead  of  directly 
identifying  a  target  class,  we  use  a  regressor  in  each  position  of  the  list  to  produce  an  estimate 
of  the  gain  from  each  item  at  that  particular  position.  Mbj  calculated  in  line  2  is  a  |T>||M| 
matrix  of  the  actual  marginal  benefit  computed  in  a  similar  fashion  as  Ml;  of  Algorithm  1, 
and  MBi  is  the  estimate  given  by  the  regressor  at  the  ith  position.  In  line  2  we  compute 
the  feature  matrix  X|  and  Mbj  in  the  function  computeFeaturesAndBenefit.  This  function 
is  detailed  in  Algorithm  5.  In  this  case,  a  feature  vector  is  computed  per  item  per  example 
in  the  function  expressItemlnExample.  The  implementation  of  this  function  is  application 
dependent  and  a  concrete  example  will  be  provided  in  Chapter  2.1.  By  stacking  up  all  such 
feature  vectors  xj  where  j  indexes  into  examples,  we  build  up  the  matrix  Xi.  For  feature 
vectors  of  length  L,  Xi  has  dimensions  |D||M|  x  L.  The  features  and  gains  at  the  ith  slot 
are  used  to  train  regressor  TZ\  and  then  invoked  on  the  same  training  data  producing  the 
estimate  MBi.  Note  that  each  data  point  consists  of  a  row  of  the  matrix  Xi  and  the  target 
value  is  the  corresponding  entry  for  that  item  and  example  in  Mbj.  For  each  example,  we 
pick  the  item  a  which  produces  the  maximum  entry  in  the  corresponding  row  in  Mb;  to  be 
our  chosen  item  for  that  example  (Line  5  in  Algorithm  4).  These  items  are  then  stacked  up 
to  produce  Y^.  which  is  then  a  \V\  length  vector  as  illustrated  in  Figure  2.1. 

For  inference,  we  take  the  list  of  trained  regressors  obtained  from  training  and  a  test 
example  d,  express  all  items  in  A  in  this  new  example  to  obtain  matrix  X,  where  each  row 
is  the  feature  vector  describing  an  item  from  A  in  that  example.  We  invoke  the  regressor  for 
that  position  of  the  list,  on  each  feature  vector  in  X  to  get  the  predicted  gains  of  each  item 
at  that  particular  position.  We  choose  the  item  with  the  maximum  such  predicted  gain  and 
fill  the  list  at  that  location  with  this  item.  This  procedure  is  repeated  for  every  position  of 
the  list. 

Online  Variants:  Algorithms  7  and  8  detail  the  analogous  online  variants  of  ConSeqOpt 
Batch  using  classifiers  and  regressors  respectively.  The  online  versions  train  N  copies  (one 
copy  per  position)  of  online  variants  of  multi-class  cost-sensitive  classification  and  regression 
like  online  Support  Vector  Machines  (SVM)  [Hazan  et  al.  2007;  Ratliff  et  al.  2007b],  process 
one  example  at  a  time  by  sampling  from  D,  and  updating  themselves  using  the  loss  incurred 
on  that  example  by  using  the  “update”  function  to  update  the  internal  distribution  over 
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Algorithm  4  ConSeqOpt  Batch:  Algorithm  for  training  using  regressors 

Input:  List  length  N, 

Regressor  training  procedure:  rI:dII^IxL  x 
Dataset  V  of  \V\  number  of  examples, 

Library  of  items  A 

Output:  list  of  regressors  {7^i,7^2,  •  •  •  ,  7£ n }  (7£  :  M)v\A\xL  -a  R^l^l) 

1:  for  i  =  1  to  N  do 

2:  Xi,  Mbj  <—  computeFeaturesAndBenefit(P,  {Y^,  Y-^, . . . ,  Y^.^},  A) 

3:  7 Zi  c-  train(Xi,  MbJ 

4:  MBi  ^(Xi) 

5:  Yn.  C-  argmax(MBi) 

6:  end  for 


Algorithm  5  computeFeaturesAndBenefit(P,  {Y^,  Y^2, . . . ,  Y A) 

Input:  Dataset  P, 

Current  regressor  list  output  on  dataset  {Y^,  Y^2, . . . ,  Y^.^},  ( Yjz  : 

Library  of  items  A 

Output:  MBi  :  R|P|1^,  Xt  :  R^I^IxL 

1:  Xj  =  [] 

2:  for  j  =  1  to  |P|  do 
3:  for  k  =  1  to  |A|  do 

4:  MBi[j, fc]  =  f(YUl\j\  ©  Y^2[j]...Y^_1[j]  0  A[k\,x j)  -  Z(Y^[j]  © 

Y7e2[j]...Y7ei_1[j],xj) 

5:  xj  expressItemInExample(dj,  Ak) 

6:  Xi  i —  Xi  ©  Xj 

7:  end  for 

8:  end  for 


Algorithm  6  ConSeqOpt  Batch:  Algorithm  for  inference  using  regressors 

Input:  Trained  list  of  regressors  {7^i,  7^2,  *  •  •  7£/v}, 

Test  example  d 

Output:  List  of  selected  items  S  —  {a±,  <22, . . . ,  a at} 

1:  <S  =  {} 

2:  X  =  [] 

3:  for  k  =  1  to  \A\  do 

4:  xk  <—  expressItemInExample(d,  A[k]) 

5:  X  <-  X  ©  xk 

6:  end  for 
7:  for  i  =  1  to  N  do 
8:  a  <—  argmax  7^(X) 

9:  $  4 —  $  ©  a 

10:  end  for 


Algorithm  7  ConSeqOpt  Online:  Algorithm  for  training  using  online  classifiers 
Input:  List  length  N, 

Multi-class  cost-sensitive  online  classifier  training  procedure:  RL  x  Rl^l  — >  T~L  , 
Distribution  of  examples 
Library  of  items  A 

Output:  list  of  online  classifiers  $2,  •  •  • ,  &n}  (3>  :  K' L  — ^  *4.) 

1:  for  £  =  1  to  T  do 

2:  dt  <—  Sample(P) 

3:  xt  <—  computeFeatures(dt) 

4:  for  i  =  1  to  N  do 

5:  MLti  computeMarginalLoss(xt,  {Y$15 . . .  A) 

6:  update(xt,  MLti) 

7:  Y$.  <—  $i(xt) 

8:  end  for 

9:  end  for 


Algorithm  8  ConSeqOpt  Online:  Algorithm  for  training  using  online  regressors 
Input:  List  length  N, 

Online  regression  training  procedure:  Rl^lxL  x  Rl^l 
Distribution  of  examples 
Library  of  items  A 

Output:  list  of  online  regressors  {Ti,  T2, Tjv}  (T  :  Rl^lxL  Rl^l) 

1:  dt  <—  Sample(P) 

2:  for  t  =  1  to  T  do 
3:  for  i  =  1  to  N  do 

4:  Xti,  MBti  computeFeatureAndBenefit(dti,  {Yxx , . . .  jYx-  J,  A) 

5:  <-  update(xti,  MBti) 

6:  MBti  <-  Tj(xti)  _ 

7:  Y ?.<-  argmax(MBti) 

8:  end  for 

9:  end  for 
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learners. 


l 


Algorithm  7  runs  for  T  rounds,  sampling  an  example  dt  each  round,  computing  its 
feature  representation  xt  and  using  the  list  of  online  cost-sensitive  classifiers  to  compute 
the  marginal  loss  of  each  action  at  every  position  and  then  updating  itself  using  the  marginal 
loss  matrix  MLtl.  Note  that  Mlu  now  contains  only  one  row  since  there  is  only  a  single 
example.  Then  the  updated  online  classifier  is  used  to  predict  the  correct  item  on  the  example 
to  produce  Y«£>.. 

Similarly,  the  online  version  using  regressors  in  Algorithm  8  also  samples  an  example 
each  round  and  updates  a  list  of  online  regressors  Y^. 

The  inference  procedures  are  the  same  as  before  in  Algorithm  3  and  6  for  classification 
and  regression  respectively. 


Yn  case  of  a  deterministic  algorithm,  the  update  function  updates  the  internal  policy  for  choosing  an  item 
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2.1  Case  Study:  Manipulation  Planning  via  Contextual  Con¬ 
trol  Libraries 

In  manipulation  a  common  reoccuring  task  is  to  find  a  collision-free  path  from  start  to 
goal  location.  This  is  a  challenging  problem  as  this  usually  involves  finding  feasible  paths 
in  the  configuration  (joint)  space  of  a  manipulator  which  are  typically  high  dimensional. 
Probabilistically  complete  planners  like  RRT  [Kuffner  and  LaValle  2000]  and  PRM  [Kavraki 
et  al.  1996]  are  guaranteed  to  find  a  feasible  path  provided  there  exists  at  least  one  such  path 
from  start  to  goal.  Unfortunately  they  can  be  expensive  to  run  in  such  high  dimensional 
spaces. 

Recent  work  [Zucker  et  al.  2013;  Jetchev  and  Toussaint  2009],  has  shown  an  alternate 
approach  which  proceed  by  relaxing  the  hard  constraint  of  avoiding  obstacles  into  a  soft 
penalty  term  on  collision  and  use  simple  local  optimization  techniques  that  quickly  lead  to 
smooth,  collision-free  trajectories  suitable  for  robot  execution.  Often  the  default  initialization 
trajectory  seed  is  a  simple  straight-line  initialization  in  joint  space  [Zucker  et  al.  2013].  This 
heuristic  is  surprisingly  effective  in  many  examples,  but  suffers  from  local  convergence  and 
may  fail  to  find  a  trajectory  even  when  one  exists.  In  practice,  this  may  be  tackled  by 
providing  cleverer  initialization  seeds  using  classification  [Jetchev  and  Toussaint  2009;  Zucker 
2009].  While  these  methods  reduce  the  chance  of  falling  into  local  minima,  they  do  not 
have  any  alternative  plans  should  the  chosen  initialization  seed  fail.  A  contextual  ranking 
of  a  library  of  initialization  seeds  using  ConSeqOpt  can  provide  feasible  alternative  seeds 
should  earlier  choices  fail.  We  take  an  approach  similar  to  Dragan  et  al.  [Dragan  et  al.  2011] 
where  novel  trajectories  can  be  evaluated  with  respect  to  the  environment  the  manipulator  is 
operating  in  currently.  A  library  of  proposed  initialization  trajectory  seeds  can  be  developed 
in  many  ways  including  human  demonstration  [Ratliff  et  al.  2007a]  or  use  of  a  slow  but 
complete  planner  [Kuffner  and  LaValle  2000] . 

We  conducted  experiments  where  we  attempt  to  plan  a  trajectory  to  a  pre-grasp  pose 
(goal)  over  the  target  object  in  a  cluttered  example  using  the  local  optimization  planner 
CHOMP  [Zucker  et  al.  2013]  and  minimize  the  total  planning  and  execution  time  of  the 
trajectory.  A  training  dataset  of  \V\  =  310  examples  and  test  dataset  of  212  examples  are 
generated.  Each  example  contains  a  table  surface  with  five  obstacles  and  the  target  object  is 
randomly  placed  on  the  table.  The  starting  pose  of  the  manipulator  is  randomly  assigned, 
and  the  robot  must  find  a  collision-free  trajectory  to  the  end  pose  above  the  target  object. 
To  populate  the  control  library,  we  consider  initialization  trajectories  that  move  first  to  an 
“exploration  point"  and  then  to  the  goal.  The  exploration  points  are  generated  by  randomly 
perturbing  the  midpoint  of  the  original  straight  line  initialization  in  joint  space.  The  resulting 
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initial  trajectories  are  then  piecewise  straight  lines  in  joint  space  from  the  start  point  to  the 
exploration  point,  and  from  the  exploration  point  to  the  goal.  2  30  trajectories  generated 
with  the  above  method  form  our  control  library  A.  Figure  2.2a  shows  an  example  set  for  a 
particular  example.  Notice  that  in  this  case  the  straight-line  initialization  of  CHOMP  goes 
through  the  obstacle  and  therefore  CHOMP  has  a  difficult  time  finding  a  valid  trajectory 
using  this  initial  seed. 


(a)  The  default  straight-line  initialization  of 
CHOMP  is  in  bold.  Notice  this  initial  seed  goes 
straight  through  the  obstacle  and  causes  CHOMP 
to  fail  to  find  a  collision- free  trajectory. 


(b)  The  initialization  seed  for  CHOMP  found  us¬ 
ing  ConSeqOpt  is  in  bold.  Using  this  initial  seed 
CHOMP  is  able  to  find  a  collision  free  path  that 
also  has  a  relatively  short  execution  time. 


Figure  2.2:  CHOMP  initialization  trajectories  generated  as  control  actions  for  ConSeqOpt. 
The  end  effector  path  of  each  trajectory  in  the  library  is  traced  out.  The  trajectory  in  bold 
in  each  image  traces  the  initialization  seed  generated  by  the  default  straight-line  approach 
and  by  ConSeqOpt,  respectively. 


In  our  results  we  use  a  small  list  (3  positions)  to  ensure  the  overhead  of  ordering  and 
evaluating  the  library  is  small.  When  CHOMP  fails  to  find  a  collision-free  trajectory  for 
multiple  initializations  seeds,  one  can  always  fall  back  on  slow  but  complete  planners.  Thus 
the  contextual  control  sequence’s  role  is  to  quickly  evaluate  a  few  good  options  and  choose  the 
initialization  trajectory  that  will  result  in  the  minimum  execution  time.  We  note  that  in  our 
experiments,  the  overhead  of  ordering  and  evaluating  the  library  is  negligible  as  we  rely  on  a 
fast  predictor  and  features  computed  as  part  of  the  trajectory  optimization,  and  by  choosing 
a  small  list  length  we  can  effectively  compute  a  motion  plan  with  expected  planning  time 

2 Half  of  the  seed  trajectories  are  prepended  with  a  short  path  to  start  from  an  elbow- left  configuration, 
and  half  are  in  an  elbow-right  configuration.  This  is  because  the  local  planner  has  a  difficult  time  switching 
between  configurations,  while  environmental  context  can  provide  much  information  about  which  configuration 
to  use. 
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under  0.5  seconds.  We  can  solve  most  manipulation  problems  that  arise  in  our  manipulation 
research  very  quickly,  falling  back  to  initializing  the  trajectory  optimization  with  a  complete 
motion  planner  only  in  the  most  difficult  of  circumstances. 

For  each  initialization  trajectory,  we  calculate  17  simple  feature  values  which  populate  a 
row  of  the  feature  matrix  Xp  3  During  training  time,  we  evaluate  each  initialization  seed  in 
our  library  on  all  examples  in  the  training  set,  and  use  their  performance  and  features  to  train 
each  regressor  7 Zi  in  ConSeqOpt  Batch  (Algorithm  4).  At  inference  time,  we  simply  run 
Algorithm  6  to  produce  , . . . ,  Y^N  as  the  sequence  of  initialization  seeds  to  be  evaluated. 
Note  that  while  the  first  regressor  uses  only  the  17  basic  features,  the  subsequent  regressors 
also  include  the  difference  in  feature  values  between  the  remaining  actions  and  the  actions 
chosen  by  the  previous  regressors.  These  difference  features  improve  the  algorithm’s  ability 
to  consider  trajectory  diversity  in  the  chosen  actions. 

We  compare  ConSeqOpt  Batch  with  two  methods  of  ranking  the  initialization  library: 
a  random  ordering  of  the  actions,  and  an  ordering  by  sorting  the  output  of  the  first  regressor. 
Sorting  by  the  first  regressor  is  functionally  the  same  as  maximizing  the  absolute  benefit 
rather  than  the  marginal  benefit  at  each  slot.  We  compare  the  number  of  CHOMP  failures 
as  well  as  the  average  execution  time  of  the  final  trajectory.  For  execution  time,  we  assume 
the  robot  can  be  actuated  at  1  rad/second  for  each  joint  and  use  the  shortest  trajectory 
generated  using  the  N  seeds  ranked  by  ConSeqOpt  Batch  as  the  performance.  If  we  fail 
to  find  a  collision  free  trajectory  and  need  to  fall  back  to  a  complete  planner  (RRT  [Kuffner 
and  LaValle  2000]  plus  trajectory  optimization),  we  apply  a  maximum  execution  time  penalty 
of  40  seconds  due  to  the  longer  computation  time  and  resulting  trajectory. 

The  results  over  212  test  examples  are  summarized  in  Figure  2.3.  With  only  simple 
straight  line  initialization,  CHOMP  is  unable  to  find  a  collision  free  trajectory  in  162/212 
examples,  with  a  resulting  average  execution  time  of  33.4  seconds.  While  a  single  regressor 
(N  =  1)  can  reduce  the  number  of  CHOMP  failures  from  162  to  79  and  the  average  execution 
time  from  33.4  to  18.2  seconds,  when  we  extend  the  sequence  length,  ConSeqOpt  is  able  to 
reduce  both  metrics  faster  than  a  ranking  by  sorting  the  output  of  the  first  regressor.  This 
is  because  for  N  >  1,  ConSeqOpt  Batch  chooses  a  primitive  that  provides  the  maximum 
marginal  benefit,  which  results  in  trajectory  seeds  that  have  very  different  features  from  the 
previous  slots’  choices.  Ranking  by  the  absolute  benefit  tends  to  pick  trajectory  seeds  that 
are  similar  to  each  other,  and  thus  are  more  likely  to  fail  when  the  previous  seeds  fail.  At 

3 Length  of  trajectory  in  joint  space;  length  of  trajectory  in  task  space,  the  position  (x,y,z)  values  of  the 
end  effector  position  at  the  exploration  point  (3  values),  the  distance  field  values  used  by  CHOMP  at  the 
quarter  points  of  the  trajectory  (3  values),  joint  values  of  the  first  4  joints  at  both  the  exploration  point  (4 
values)  and  the  target  pose  (4  values),  and  whether  the  initialization  seed  is  in  the  same  left/right  kinematic 
arm  configuration. 
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a  sequence  length  of  3,  ConSeqOpt  Batch  has  only  16  failures  and  an  average  execution 
time  of  8  seconds.  A  90%  improvement  in  success  rate  and  a  75%  reduction  in  execution 
time.  Note  that  planning  times  are  generally  negligible  compared  to  execution  times  for 
manipulation  hence  this  improvement  is  significant.  Figure  2.2b  shows  the  initialization  seed 
found  by  ConSeqOpt  Batch  for  the  same  example  as  in  Figure  2.2a.  Note  that  this  seed 
avoids  collision  with  the  obstacle  between  the  manipulator  and  the  target  object  enabling 
CHOMP  to  produce  a  successful  trajectory. 


Straight  Seed  12  3  Straight  Seed  12  3 


Sequence  length  (N) 


Sequence  length  (N) 


Figure  2.3:  Results  of  ConSeqOpt  Batch  for  manipulation  planning  in  212  test  examples. 
The  top  image  shows  the  number  of  CHOMP  failures  for  three  different  methods  after  each 
slot  in  the  sequence.  ConSeqOpt  Batch  not  only  significantly  reduces  the  number  of 
CHOMP  failures  in  the  first  slot,  but  also  further  reduces  the  failure  rate  faster  than  both 
the  other  methods  when  the  sequence  length  is  increased.  The  same  trend  is  observed  in  the 
bottom  image,  which  shows  the  average  time  to  execute  the  chosen  trajectory.  The  ‘Straight 
Seed’  column  refers  to  the  straight-line  heuristic  used  by  the  original  CHOMP  implementation 


We  confirm  from  this  experiment  our  initial  intuition  that  diversity  in  high  ranks  is 
important  to  avoid  selecting  redundant  items  which  are  highly  likely  to  fail  together.  One 
could  conceivably  enforce  diversity  by  first  finding  an  appropriate  distance  measure  between 
items  and  then  selecting  items  sequentially  in  a  greedy-like  manner  where  an  item  is  only 
added  to  the  list  if  it  is  maximum  distance  away  from  all  previously  selected  items  currently 
in  the  list.  In  practice  such  a  procedure  is  difficult  to  implement  because  of  two  reasons: 
1)  the  problem  of  finding  the  right  distance  measure  between  high  dimensional  objects  like 
grasps,  trajectories  in  configuration  space  etc  is  non-trivial  and  2)  this  brings  up  the  question 
of  how  much  diversity  is  to  be  enforced  and  usually  leads  to  free  parameters  which  need  to 
be  then  cross- validated  [Batra  et  al.  2012].  By  observing  the  mathematical  properties  of  the 
task  objective  (monotone  submodular)  we  elegantly  sidestep  both  of  these  issues  and  are  also 
able  to  provide  performance  guarantees  of  the  algorithm. 
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2.2  ConSeqOpt  Analysis 


In  this  section  we  analyze  ConSeqOpt  Batch  in  detail  and  provide  details  on  when  such  a 
procedure  will  work,  the  explicit  assumptions,  the  applicability  of  such  assumptions  to  real 
world  list  prediction  problems  and  performance  guarantees  of  our  methods. 

2.2.1  Submodularity  of  Lists  and  the  Greedy  Algorithm 

A  function  /  :  S  x  V  [0, 1]  is  monotone  submodular  for  any  list  S  E  S  and  example  d  ~  P, 
where  S  is  the  set  of  all  lists  if  it  satisfies  the  following  two  properties: 

•  (Monotonicity)  for  any  list  Si,  S2  E  <S,  /(Si)  <  /(Si  ©  S2)  and  /(S2)  <  /(Si  ©  S2) 

•  (Submodularity)  for  any  list  Si,  S2  E  S,  /(Si)  and  any  item  a  E  A,  /(Si  ©S2©a)  — 

/(SiffiS2)</(Si©a)-/(Si) 

where  ©  means  order  dependent  concatenation  of  lists,  A  is  the  library  of  all  available  items. 
In  this  work  we  are  concerned  with  contextually  maximizing  such  monotone  submodular 
objective  functions.  We  will  show  through  numerous  case  studies  (Chapter  3.2)  the  wide  real 
world  applicability  of  ConSeqOpt  and  its  more  efficient  variants. 

So  the  first  question  that  naturally  arises  is  when  a  list  length  N  is  specified,  what  are 
the  items  from  A  one  should  put  in  the  list  to  maximize  the  objective  of  interest.  Another 
closely  related  question  is  what  is  the  order  in  which  items  in  A  should  be  evaluated  such  that 
the  probability  of  encountering  an  item  which  succeeds  at  the  task  at  hand  is  maximized. 
This  is  exactly  the  question  being  asked  in  the  robot  manipulation  path  planning  problem 
in  Chapter  2.1.  Unfortunately,  the  answer  to  both  of  the  above  questions  was  proven  to  be 
NP-hard  by  Nemhauser  et  al.  [Nemhauser  et  al.  1978].  This  implies  that  the  only  way  to  find 
the  best  list  of  specified  length  or  the  best  ordering  of  all  items  in  the  library  is  to  enumerate 
all  possible  lists  of  items  and  score  each  list  using  the  monotone  submodular  objective  and 
pick  the  highest  scoring  one.  Even  for  relatively  small  sized  libraries  the  set  of  all  possible 
lists  grows  exponentially  prohibitively  large.  For  example  for  a  library  of  30  items,  the  set 
of  all  lists  of  length  30  has  30!  =  2.6525286E  +  32  lists  in  it  (without  replacement)  and  3030 
(with  replacement). 

But  it  turns  out  that  there  exist  simple  algorithms  with  good  approximation  guarantees 
for  maximizing  monotone  submodular  functions. 

Consider  the  Greedy  algorithm  (Algorithm  9)  which  proceeds  by  first  initializing  an 
empty  list  S  and  selecting  each  item  such  that  the  addition  of  the  item  to  the  existing  list 
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Algorithm  9  Greedy 
Input:  List  length  N, 

Library  of  items  A 

Output:  List  of  items  S  =  ai,  <22, . . .  aw 
1:  5  =  {} 

2:  for  i  =  1  to  ./V  do 
3:  for  j  =  1  to  |.A|  do 

4:  gj  =Ed„v  f(S®a,j,d)  -  f(S,d) 

5:  end  for 

6:  a  —  argmax  gj 

7:  S  i —  S  ®  CL 

8:  end  for 


up  to  that  point  increases  /  the  most  in  expectation  over  a  distribution  of  examples  d  ~  V. 
The  loop  stops  once  N  items  have  been  picked. 

Feige  proved  [Feige  1998]  that  the  list  returned  by  the  greedy  algorithm  is  guaranteed  to 
achieve  (1  —  of  the  optimal  list  of  length  N  i.e.  : 

IEd~x>  /(^greedy,  d)  >  (l  -  max  Ed^x>  f(S,  d).  (2.1) 

This  problem  of  finding  the  optimal  list  of  budgeted  size  N  has  been  studied  in  literature  as 
the  Budgeted  Maximum  Submodular  Coverage  problem  [Khuller  et  al.  1999;  Streeter 
and  Golovin  2008]. 

Additionally  Feige  et  al.,  [Feige  et  al.  2004]  proved  that  for  the  problem  of  finding  the 
optimal  ordering  of  all  items  in  A  so  that  the  successful  item  can  be  encountered  as  early 
on  as  possible,  the  ordering  returned  by  Greedy  is  guaranteed  to  find  the  successful  item 
within  4  times  the  depth  of  the  optimal  ordering: 

cost  (S greedy)  <  4  cost  (argmax  f(S,  d) ),  (2.2) 

V  \S\<\A\  7 

where  cost  (S')  is  the  number  of  elements  of  the  list  we  had  to  go  through  to  find  the  one  that 
succeeded  at  the  task  at  hand.  Such  problems  have  been  studied  in  literature  as  Min-Sum 
Submodular  Set  Cover  [Feige  et  al.  2004;  Streeter  and  Golovin  2008;  Munagala  et  al. 
2005]. 

In  the  manipulation  planning  example  in  Chapter  2.1,  the  task  is  to  find  an  ordering 
of  the  initial  trajectories  in  the  library  of  30  initial  trajectories  so  that  we  can  encounter 
the  trajectory  which  succeeds  in  finding  a  collision  free  path  as  early  on  as  possible.  This 
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can  be  equivalently  phrased  as  maximizing  the  probability  of  encountering  a  successful  ini¬ 
tial  trajectory  in  a  given  list  S  of  trajectories  over  a  distribution  of  examples  d  ~  V,  i.e. 
Ed r^v  f(S ,  d)  =  Ed^D  P(S,  d).  Streeter  and  Golovin  [Streeter  and  Golovin  2008]  proved  that 
such  objectives  are  monotone,  submodular.  Therefore  given  a  dataset  of  training  examples 
one  can  use  the  Greedy  algorithm  detailed  above  (Algorithm  9)  for  finding  the  best  ordering 
of  the  initial  trajectories  in  the  library  \A\.  The  Greedy  solution  will  have  the  performance 
guarantees  outlined  above  for  both  Budgeted  Maximum  Submodular  Coverage  and 
Min-Sum  Submodular  Set  Cover. 

Intuitively,  the  Greedy  method  is  picking  items  in  the  list  which  provide  the  maximum 
marginal  benefit  at  that  position  of  the  list.  This  way  additions  to  the  list  which  do  not 
result  in  much  gain  to  the  objective  are  avoided.  Another  way  of  looking  at  this  for  the 
manipulation  problem  is  to  imagine  picking  an  initial  trajectory  in  the  first  position  of  the 
list  hoping  that  it  will  be  successful  over  most  of  the  examples  it  encounters,  but  due  to 
imperfection  in  both  data  and  evaluation  it  fails.  What  this  immediately  shows  is  that  initial 
trajectories  similar  to  the  first  one  are  likely  to  fail  as  well  and  should  not  be  selected  for 
the  second  position.  Initial  trajectories  which  are  different  from  the  first  one,  but  are  still 
likely  to  succeed  should  be  selected  (diverse  but  relevant).  The  Greedy  algorithm  naturally 
captures  this  notion  of  diversity  and  relevance  when  the  objective  is  monotone  submodular 
and  comes  up  with  an  optimal  list  up  to  approximation  guarantees. 

But  the  Greedy  algorithm  has  a  crucial  limitation:  it  ignores  the  context  of  the  example 
while  constructing  the  list  of  initial  trajectories.  Ideally  we  would  like  a  method  that  takes 
into  account  the  rich  features  of  the  example  available  through  the  various  sensors  ( e.g . 
cameras,  depth  cameras,  lidars)  mounted  on  the  manipulator  to  aid  in  predicting  a  list 
of  initial  trajectories  to  maximize  the  probability  of  finding  a  successful  one  as  early  as 
possible.  In  the  following  section  we  analyze  how  ConSeqOpt  overcomes  this  limitation 
while  maintaining  the  performance  guarantees  of  the  Greedy  algorithm. 

2.2.2  ConSeqOpt:  Contextualizing  the  Greedy  Algorithm 

We  will  first  examine  how  the  Greedy  algorithm  is  contextualized  by  a  reduction  to  learning 
a  list  of  classifiers  to  result  in  the  ConSeqOpt  Batch  algorithm  detailed  in  Algorithms  1 
and  4. 

In  machine  learning,  reduction  [Alina  Beygelzimer  and  Zadrozny  2009]  is  the  process  of 
breaking  down  a  challenging  problem  for  which  no  easy  solution  exists  to  smaller  problems  for 
which  well-understood  theoretical  and  practical  solutions  exist.  By  relating  the  performance 
of  the  solutions  to  the  smaller  problems,  statements  can  be  made  about  the  quality  of  the 
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solution  to  the  more  challenging  problem. 

We  reduce  the  problem  of  predicting  lists  to  invoking  a  list  of  multi-class  classifiers, 
each  of  which  in  turn  predicts  an  item  in  the  resulting  list.  We  establish  a  formal  regret 
reduction  [Beygelzimer  et  al.  2005]  between  cost  sensitive  multi-class  classification  error  and 
the  resulting  error  on  the  learned  list  of  classifiers.  Specifically,  we  demonstrate  that  if  we 
consider  the  items  in  A  to  be  the  classes  and  train  a  series  of  classifiers-one  for  each  position 
of  the  list-on  the  features  of  a  distribution  of  examples,  we  can  then  produce  a  near-optimal 
list  of  classifiers,  which  in  turn  can  be  invoked  to  produce  the  near-optimal  list  of  items  for 
a  test  example. 

Theorem  1.  If  each  of  the  classifiers  {tti  ,  7T2, . . . ,  7Ti, . . . ,  7Tn}  trained  in  Algorithm  1  achieves 
multi-class  cost-sensitive  regret  of  ri,  then  the  resulting  list  of  classifiers  is  within  at  least 
(1  —  m ax  /(S',  d)  —  Y^iLiri  of  the  optimal  such  list  of  classifiers  S  from  the 

same  hypothesis  space .  4 

Proof.  (Sketch)  Define  the  loss  of  a  multi-class,  cost-sensitive  classifier  n  over  a  distribution 
of  examples  V  as  l(ir,V).  Each  example  can  be  represented  as  (xn,  ra4,  m^,  m^, . . . ,  rai^) 
where  xn  is  the  set  of  features  representing  the  nth  example  and 

m4 ,  •  •  •  ?  m are  the  per  class  costs  of  misclassifying  xn.  m4 ,  m^,  m^, . . . ,  m are 

simply  the  nth  row  of  Mlj  (which  corresponds  to  the  nth  example  in  the  dataset  V).  The  best 
class  has  a  0  misclassification  cost  and  while  others  are  greater  than  equal  to  0  (There  might 
be  multiple  actions  which  will  yield  equal  marginal  benefit).  Classifiers  generally  minimize 
the  expected  loss  Z (tt,  X>)  =  E  [C^On)]  where  C^(x\  —  denotes  the 

example-dependent  multi-class  misclassification  cost.  The  best  classifier  in  the  hypothesis 
space  Pi  minimizes  Z (w,  X>) 

7r*  =  argmin  E  [C^n)].  (2.3) 

71 r<ETT  On, ml  ,m2  ,ra3,...,mlr 

The  regret  of  n  is  defined  as  r  —  1(tt,V)  —  Z(7r*,P).  Each  classifier  associated  with  ith 
slot  of  the  list  has  a  regret  r*. 

Streeter  et  al.  [Streeter  and  Golovin  2008]  consider  the  case  where  the  ith  decision  made 
by  the  greedy  algorithm  is  performed  with  additive  error  e^.  Denote  by  S  —  (si,  £2, . . . ,  Sjv) 
a  variant  of  the  list  S  in  which  the  ith  argmax  is  evaluated  with  additive  error  C{.  This  can 

4 When  the  objective  is  to  minimize  the  time  (depth  in  list)  to  find  a  satisficing  element  then  the  resulting 
list  of  classifiers  Edf^  f(S(N),  d)  <4  fo°°  1  -  maxses  Ed^  f(S(n) ,  d)dn  +  YliLi  r*- 
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be  formalized  as 


E, 


'd  rJD 


f(Si  ©  Si,  d)  -  /(Si,  d)  >  max  Ed^x>  /(<%  ©  s;,  d)  -  f(Si,  d) 

J  SieA  l 


-  e*. 


(2.4) 


where  —  (),  =  (si,  82,  S3, ... ,  s^_i)  for  i  >  1  and  is  the  predicted  item  by  classifier  7q. 

They  demonstrate  that,  for  a  budget  or  list  length  of  N 


1  N 

/(S(AT),d)  >  (1  -  -)  max  Ed^x>  /(S,d)  -  , 

e  & 


(2.5) 


2=1 


assuming  each  item  takes  equal  time  to  execute. 

Thus  the  ith  argmax  in  (2.4)  is  chosen  with  some  error  C{  —  r{.  An  C{  error  made 
by  classifier  7 r/  corresponds  to  the  classifier  picking  an  item  whose  gain  is  less  than  the 
maximum  possible.  Hence  the  performance  bound  on  additive  error  greedy  list  construction 
stated  in  (2.5)  can  be  restated  as 


Ed~D  f(S{N) ,  d)  >  (1  -  -)  max  Ed~v  f(S,  d)  -  ri ■ 

2=1 


(2.6) 


□ 

Theorem  2.  The  list  of  squared-loss  regressors  {7^i, . . . ,  7^, . . . ,  TZn}  trained  in  Algorithm 

4  is  within  at  least  (1  —  ^)max  /(^)  —  1  ^2(|^4|  —  l)rreg.  of  the  optimal  list  of  classifiers 

5  from  the  hypothesis  space  of  multi- class  cost- sensitive  classifiers . 


Proof.  (Sketch)  Langford  et  al.  [Langford  and  Beygelzimer  2005]  show  that  the  regret  reduc¬ 
tion  from  multi-class  classification  to  squared-loss  regression  has  a  regret  of  y^2(|fc|  —  l)rreg 
where  k  is  the  number  of  classes  and  rreg  is  the  squared-loss  regret  on  the  underlying  re¬ 
gression  problem.  In  Algorithm  4  we  use  squared-loss  regression  to  perform  multi-class  clas¬ 
sification  thereby  incurring  for  each  slot  of  the  list  a  reduction  regret  of  ^2(|*4|  —  l)rreg. 
where  \A\  is  the  number  of  items  in  the  library  and  rreg.  is  the  regret  of  the  regressor 
for  the  ith  slot.  Theorem  1  states  that  the  list  of  classifiers  achieve  /(Sqv>,d)  > 

(1  —  i)  max^^^Ed^D  /(S',  d)  —  Y^iLi  ri  °f  the  optimal  list  of  classifiers.  Plugging  in  the  re¬ 
gret  reduction  from  [Langford  and  Beygelzimer  2005]  we  get  the  result  that  the  resulting  list  of 
regressors  in  Algorithm  4  is  within  at  least  (1  —  ^)  max  /(S,  d)  —  Y^iLi  ^2(|*4.|  —  l)rreg. 

of  the  optimal  list  of  multi-class  cost-sensitive  classifiers.  □ 


Theorem  1  proves  that  ConSeqOpt  Batch  using  classifiers  efficiently  finds  the  approx¬ 
imately  greedy  list  of  classifiers  from  the  hypothesis  space  of  all  such  classifiers.  Similarly 
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Theorem  2  proves  that  ConSeqOpt  Batch  using  regressors  also  find  the  approximately 
greedy  list  of  regressors  from  the  hypothesis  space  of  all  such  regressors. 
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CHAPTER 


Data-Efficient  Contextual  Optimization  of 

Lists 


In  this  chapter  we  develop  a  more  efficient  version  of  ConSeqOpt.  ConSeqOpt  sequentially 
trains  N  classifiers  or  regressors,  one  for  each  position  of  the  list.  Since  predictors  responsible 
for  earlier  positions  of  the  list  will  generally  start  choosing  optimal  or  near-optimal  items  for 
most  examples  in  the  dataset,  the  predictors  responsible  for  later  positions  often  do  not 
observe  enough  examples  to  learn  effectively  to  predict  items  which  will  bring  the  maximum 
gain  to  the  objective  at  those  positions.  In  other  words  the  later  predictors  usually  starve 
for  data  unless  a  large  amount  of  data  is  available  to  begin  with.  We  propose  a  closely 
related  approach  which  trains  a  single  (no-regret)  online  learner  (policy)  to  produce  a  list 
of  predictions.  We  term  this  approach  as  “Submodular  Contextual  Policy”  algorithm,  or  in 
short  SCP. 

By  leveraging  recent  work  in  imitation  learning  [Ross  et  al.  2011],  SCP  preserves  similar 
performance  guarantees  as  ConSeqOpt  while  being  more  data-efficient  since  all  the  data  is 
used  for  training  the  learner.  We  will  first  describe  the  algorithm  for  the  context-free  case, 
where  no  features  of  the  example  are  available.  In  this  case  as  with  the  Greedy  algorithm,  the 
performance  of  a  list  S  G  S  is  evaluated  by  its  expected  value  over  an  unknown  distribution 
of  examples  d  ~  V:  /(S',  d)  where  /  is  monotone  submodular.  The  Greedy  algorithm 

with  perfect  knowledge  of  V  can  find  a  list  S  of  length  N  such  that  it  has  the  performance 
guarantees  listed  in  2.1  and  2.2  for  the  Min-Sum  Submodular  Set  Cover  and  Budgeted 
Maximum  Submodular  Coverage  problems  respectively.  Although  V  is  unknown,  we 
assume  (as  in  the  case  of  ConSeqOpt)  that  we  observe  samples  d  ~  V  and  can  evaluate 
any  list  S  G  S  using  the  objective  function  /(S',  d)  during  training.  Our  goal  is  to  develop 
a  computationally  and  statistically  more  efficient  algorithm,  which  has  similar  performance 
guarantees  as  ConSeqOpt. 

Algorithm  10  describes  SCP  in  the  context-free  setting  for  the  online  Budgeted  Max- 
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Algorithm  10  SCP:  Algorithm  for  training  in  context-free  setting 
Input:  List  length  A, 

List  length  of  best  list  to  compete  against  iL, 

No-regret  online  learner  routine  “update”, 

Library  of  items  A 

Output:  Learnt  internal  distribution  over  items  p  :  A  [0, 1] 

1:  for  t  —  1  to  T  do 

2:  St  =  {} 

3:  for  i  =  1  to  N  do 

4:  a  sampl e(pt) 

5:  Sf  i —  Sf  0  a 

6:  end  for 

7:  d  <—  sample(D) 

8:  for  all  a  G  A  do 

9:  rt(a)  <-  Eili(l  -  ^)Ar-^(o|5tii_1,d) 

10:  end  for 

11:  for  all  a  G  A  do 

12:  lt(a)  max  rt(a')  —  rt(a ) 

a'e*4 

13:  end  for 

14:  for  all  a  G  A  do 

15:  pt+ 1  update(/t(a)) 

16:  end  for 

17:  end  for 


Algorithm  11  SCP:  Algorithm  for  inference  in  context-free  setting 
Input:  List  length  A, 

Learnt  internal  distribution  over  items  p  :  A  [0,1], 

Library  of  items  A 

Output:  S  G  S  of  length  N 

1:  S  =  {} 

2:  for  i  =  1  to  N  do 

3:  a  sample  (p) 

4:  S  i —  S  0  a 

5:  end  for 
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IMUM  Submodular  Coverage  setting.  SCP  requires  an  online  learning  algorithm  sub¬ 
routine  (denoted  by  the  function  “update”)  that  is  no-regret  with  respect  to  a  bounded  loss 
function  l  :  A  -A  [0,1],  maintains  an  internal  distribution  over  items  in  A  for  prediction 
and  can  be  queried  for  multiple  predictions  (i.e.  multiple  samples).  Algorithms  that  meet 
these  requirements  include  Randomized  Weighted  Majority  [Littlestone  and  Warmuth  1994], 
Follow-the-Leader  [Kalai  and  Vempala  2005],  EXP3  [Auer  et  al.  2003]  and  many  others.  In 
contrast  to  prior  work  [Streeter  and  Golovin  2008;  Dey  et  al.  2012],  SCP  employs  only  a  single 
no- regret  online  learning  routine  in  the  inner  loop.  The  sample  function  samples  the  online 
learner’s  internal  distribution  over  items  in  A  to  output  an  item  a.  The  update  function 
takes  in  a  loss  lt  and  updates  the  internal  distribution  over  items. 

SCP  proceeds  by  training  over  a  sequence  of  examples  d  ^  V.  At  each  iteration  SCP 
queries  the  online  learner  to  generate  a  list  of  N  items  (via  sample (pt))  which  samples  from 
its  internal  distribution  over  items  pt,  evaluates  a  weighted  cumulative  gain  of  each  item  on 
the  sampled  list  to  define  a  loss  related  to  each  item  and  then  uses  the  online  learner  (via 
update)  to  update  its  internal  distribution. 

During  training  we  allow  the  algorithm  to  construct  lists  of  length  N  rather  than  K.  In 
its  simplest  form,  one  may  simply  choose  N  =  K.  However  it  may  be  beneficial  to  choose  N 
differently  than  iF,  as  is  shown  later  in  the  theoretical  analysis  (Chapter  3.1). 

Perhaps  the  most  unusual  aspect  of  Algorithm  10  is  how  the  loss  is  defined  using  the 
weighted  cumulative  gains  of  each  item: 

N  1 

n(a)  XX 1  “  JHa\st,i-i,d),  (3.1) 

2=1 

where  St,i- i  denotes  the  first  i  —  1  items  in  St  and  b(a\St,i~i)  =  f(St,i- i  ©a,  d)  —  f(St,i- i,  d). 
Intuitively  3.1  represents  the  weighted  sum  of  gains  of  item  a  in  example  d  had  we  added  it 
at  any  intermediate  position  in  St-  The  gains  in  different  positions  are  weighed  differently, 
where  position  i  is  adjusted  by  a  factor  (1  —  These  weights  are  derived  via  our 

theoretical  analysis  and  indicate  that  benefits  in  early  positions  should  be  more  discounted 
than  benefits  in  later  positions.  Intuitively,  this  weighting  has  the  effect  of  re-balancing  the 
benefits  so  that  each  position  contributes  more  equally  to  the  overall  loss. 

In  principle  SCP  and  ConSeqOpt  can  be  applied  in  partial  feedback  settings  ( e.g . 
advertisement  placement)  where  the  value  of  /  is  only  observed  for  some  items  by  using 
bandit  algorithms  instead  (e.g.  EXP3,  [Auer  et  al.  2003]).  As  this  is  an  orthogonal  issue,  we 
will  focus  here  on  the  full  information  feedback  case. 

We  now  consider  the  contextual  setting  where  features  v  of  each  example  d  are  observed 


31 


before  choosing  the  list.  Consider  a  hypothesis  space  II  which  has  uncountaby  many  hy¬ 
potheses.  Conceivably  the  Greedy  algorithm  can  consider  each  such  hypothesis  and  come 
up  with  the  best  ordering  of  these  hypotheses  (up  to  known  approximation  bounds).  Since 
this  is  not  feasible  to  do  in  practice  due  to  the  uncountably  many  hypotheses  contained  in 
n,  in  Chapter  2.2  we  leveraged  optimization  based  techniques  to  modify  Greedy  so  that  it 
could  be  “lifted”  to  the  space  of  classifiers  or  regressors  (hypotheses).  The  resulting  algo¬ 
rithm  ConSeqOpt  could  thus  successfully  compete  against  the  approximately  best  list  of 
hypotheses  that  the  Greedy  algorithm  could  find  if  it  could  consider  the  uncountably  many 
hypotheses  in  II.  Similarly  our  goal  here  is  to  compete  against  the  best  list  of  hypotheses 
(^i,  ^2, . . . ,  i/jjst)  from  a  hypothesis  class  II.  Each  of  these  hypotheses  are  assumed  to  choose 
an  item  solely  based  on  features  of  the  example  d  ~  V. 

We  embed  II  within  a  larger  class  II  C  II  where  hypotheses  in  II  are  functions  of  both 
example  and  a  partially  constructed  list,  (n  :  ML  x  S  ^  A  where  L  is  the  size  of  the  feature 
vector  v  representing  example  d  and  S  is  the  space  of  all  possible  lists  of  items).  Then  any 
i/j  G  II,  (^(v,  S')),  selects  an  item  to  append  to  S,  given  features  of  d  and  features  of  list  S. 
We  will  learn  a  hypothesis  (or  distribution  of  hypotheses)  from  II  that  attempts  to  generalize 
list  construction  across  multiple  positions  of  the  list.  1 

Algorithm  12  details  an  extension  of  SCP  to  the  contextual  setting.  At  each  iteration, 
SCP  constructs  a  list  St  for  the  example  d,  using  its  current  hypothesis  or  by  sampling  from 
its  current  distribution  over  hypotheses.  Analogous  to  the  context-free  setting,  we  define  a 
loss  function  over  the  learner  subroutine  “update”.  We  represent  the  loss  using  weighted  cost- 
sensitive  classification  examples  where  vu  denotes  features  of  the  example 

d  and  list  St,i- i,  wu  =  (1  —  is  the  weight  associated  to  this  example,  and  cu  is  the 

cost  vector  specifying  the  cost  of  each  item  a  G  A: 

cu(a)  =  max  b(a!\Stii-i,  d)  -  b(a\St,i-i,d).  (3.2) 

a'eA 

The  loss  incurred  by  any  hypothesis  i/j  is  defined  by  its  loss  on  this  set  of  cost-sensitive 
classification  examples  i.e. 

N 

(3.3) 

2=1 

These  new  examples  are  then  used  to  update  the  hypothesis  (or  distribution  over  hypotheses) 
using  a  no-regret  online  algorithm  “update”. 

This  reduction  effectively  transforms  the  task  of  learning  a  hypothesis  for  this  submodular 

1  Competing  against  the  best  list  of  hypotheses  in  II  is  difficult  in  general  as  it  violates  submodularity: 
hypotheses  can  perform  better  when  added  later  in  the  list  (due  to  list  features).  Nevertheless,  we  can  still 
learn  from  II  and  compete  against  the  best  list  of  hypotheses  in  II. 
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Algorithm  12  SCP:  Algorithm  for  training  in  contextual  setting 
Input:  List  length  A, 

List  length  of  best  list  to  compete  against  A, 

Contextual  no-regret  online  learning  routine, 

Hypothesis  class  fl  :  ML 
Library  of  items  A 

Output:  Hypothesis  (or  distribution  over  hypotheses)  i/;  :  ML  x  S  ^  A 

1:  for  t  —  1  to  T  do 

2:  st  =  u 

3:  d  <—  sample(D) 

4:  for  i  =  1  to  N  do 

5:  Vti  <—  computeFeatures(S^_i,  d) 

6:  Sf  i  Sf  ® 

7:  Cti  <r-  [] 

8:  for  j  —  1  to  \  A\  do 

9:  ctij  <-  max  b(a'\St,i-i,  d)  -  b(aj\Stii-i,d) 

a'eA 

10:  Cfi  i  Cfi  ®  Cfij 

11:  end  for 

12:  Wti  <-  (1  -  ^)N~l 

13:  end  for 

14:  V’t+i  update^,  {(vti,  cti,  wti)}^=1) 

15:  end  for 


Algorithm  13  SCP:  Algorithm  for  inference  in  contextual  setting 
Input:  List  length  A, 

Library  of  items  A, 

Hypothesis  (or  distribution  over  hypotheses)  :  x  S  A 

Output:  List  of  selected  items  S 

1:  S={} 

2:  for  i  —  1  to  N  do 

3:  v  computeFeatures(Sf,  d) 

4:  a  ^( v ) 

5:  S  i —  S  ®  CL 

6:  end  for 
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list  optimization  problem  into  a  standard  cost-sensitive  classification  problem2.  Analogous 
to  the  context-free  setting,  we  can  also  extend  to  partial  feedback  setting  where  /  is  only 
partially  measurable  by  using  contextual  bandit  algorithms  like  EXP4  [Auer  et  al.  2003]  as 
the  online  learner.  Having  transformed  our  problem  into  online  cost-sensitive  classification, 
we  now  present  approaches  that  can  be  used  to  achieve  no-regret  on  such  tasks. 

For  finite  policy  classes  n,  one  can  leverage  any  no- regret  online  algorithm  such  as 
Weighted  Majority  [Littlestone  and  Warmuth  1994].  Weighted  Majority  maintains  a  dis¬ 
tribution  over  hypotheses  in  II  based  on  the  loss  lt(^)  for  each  ^  and  achieves  regret  at  a 
rate  of 

for  K'  =  min(7V,  K).  In  fact  the  context-free  setting  can  be  seen  as  a  special  case  where 
n  =  II  =  {^1  a  G  A}  and  ^(v)  =  a  for  any  v.  However,  achieving  no-regret  for  uncountably 
many  hypotheses  classes  is  in  general  not  tractable.  A  more  practical  approach  is  to  employ 
existing  reductions  of  cost-sensitive  classification  problems  to  convex  optimization  problems, 
for  which  we  can  efficiently  run  no-regret  convex  optimization  ( e.g .  gradient  descent).  These 
reductions  effectively  upper  bound  the  cost-sensitive  loss  by  a  convex  loss,  and  thus  bound 
the  original  loss  of  the  list  prediction  problem.  We  briefly  describe  two  such  reductions  from 
[Beygelzimer  et  al.  2005]. 

Reduction  to  Regression  We  transform  cost-sensitive  classification  into  a  regression  prob¬ 
lem  of  predicting  the  cost  of  each  item  a  G  A.  Afterwards,  we  choose  the  item  with  the  lowest 
predicted  cost.  Analogous  to  ConSeqOpt  using  regressors  in  Algorithm  4,  we  convert  each 
weighted  cost-sensitive  example  (vti,cti,wti)  into  \S\  weighted  regression  examples.  For  ex¬ 
ample,  if  we  use  least-squares  linear  regression,  the  weighted  squared-loss  for  a  particular 
example  (vti,cti,wti)  and  regressor  TZ  would  be 

l(TZ)  =  w  J2(KTvu(a)  -  c(a ))2 

Reduction  to  Ranking  Another  useful  reduction  transforms  the  problem  into  a  “ranking” 
problem  that  penalizes  ranking  an  item  a  above  a  better  item  a'.  In  our  experiments,  we 
employ  a  weighted  hinge  loss.  The  penalty  is  therefore  proportional  to  the  difference  in  cost  of 
the  mis-ranked  pair.  For  each  cost  sensitive  example  ( vu,cti,wti )  we  generate  |M|(|M|  —  l)/2 
ranking  examples  for  every  distinct  pair  of  items  (a,  a')  where  we  must  predict  the  best  item 
between  (a,  a')  (potentially  by  a  margin)  with  a  weight  wti\cu(a)  —  cu(a')\.  For  example  if 

2This  is  similar  to  DAGGER  [Ross  et  al.  2011]  developed  for  sequential  prediction  problems  like  imitation 
learning  can  be  seen  as  a  specialization  of  DAGGER  for  submodular  list  optimization  and  ensures  that  we  learn 
learners  that  pick  good  items  under  the  lists  they  construct. 


(3.5) 
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we  train  a  linear  SVM  [Joachims  2005],  we  obtain  a  weighted  hinge  loss  of  the  form: 


wti\cti(a)  -  cti(a')\  max(0, 1  -  hT (vu(a)  -  vti{a')))  sign|cti(a)  -  c^(a') |,  (3.6) 


where  h  is  the  linear  hypothesis.  At  prediction  time,  we  simply  predict  the  item  a*  = 

argmax  hTvu{a).  This  reduction  proves  advantageous  whenever  it  is  easier  to  predict  pairwise 

aeA 

rankings  rather  than  the  actual  cost. 


3.1  SCP  Analysis 

We  now  show  that  Algorithm  10  is  no-regret  with  respect  to  the  Greedy  algorithm’s  expected 
performance  over  the  training  instances.  Our  main  theoretical  result  provides  a  reduction 
to  an  online  learning  problem  and  directly  relates  the  performance  of  our  algorithm  on  the 
submodular  list  optimization  problem  to  the  standard  online  learning  regret  incurred  by  the 
routine.  Although  Algorithm  10  uses  only  a  single  instance  of  on  online  learner  routine 
it  achieves  the  same  performance  guarantee  as  prior  work  [Streeter  and  Golovin  2008]  and 
ConSeqOpt  that  employ  N  separate  instances  of  an  online  learner.  This  leads  to  a  surprising 
fact:  it  is  possible  to  sample  from  a  stationary  distribution  over  items  to  construct  a  list  that 
achieves  the  same  guarantee  as  the  Greedy  algorithm! 

For  a  sequence  of  training  examples  {dt}J=i^  let  the  sequence  of  loss  functions  {lt}f=  i 
defined  in  Algorithm  10  correspond  to  the  sequence  of  losses  incurred  in  the  reduction  to  the 
online  learning  problem.  The  expected  regret  of  the  online  learning  algorithm  is: 

T  T 

E [R]  =  a'~PMa')}  -  min (3.7) 

t=  1  ^  t= 1 

where  pt  is  the  internal  distribution  of  the  online  learner  used  to  construct  list  St-  Note 
that  an  online  learner  is  called  no-regret  if  R  is  sublinear  in  T. 

Let  F(p,  N )  =  ^sN~P[^d~v[f(S]\r,  d)]]  denote  the  expected  value  of  constructing  lists  by 

sampling  (with  replacement)  N  elements  from  distribution  p,  and  let  p  =  argmax  F(pt,N) 

te{i,2,...,T} 

denote  the  best  distribution  found  by  the  algorithm.  We  define  a  mixture  distribution  p 
over  lists  that  constructs  a  list  as  follows:  sample  an  index  t  uniformly  in  {1,  2, . . . ,  T},  then 
sample  N  elements  (with  replacement)  from  pt.  Note  that  F(p,N)  =  pJ2t=i  F(Pt,  N)  and 
F(p,  N )  >  F(p,  N).  Thus  it  suffices  to  show  that  F(p,  N )  has  good  guarantees.  We  show  that 
in  expectation  p  (and  thus  p)  constructs  lists  with  performance  guarantees  close  to  Greedy. 
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3 


Theorem  3.  Let  a  =  exp(— and  K'  =  min (N,K).  For  any  8  G  (0, 1),  with  probability 
>1-8 


F{p,N)>(l-a)F{S*K ) 


E  [R] 

~Y~ 


-3a 


1 2K'ln(2/ 8) 


(3.8) 


where  ,S'j-  is  the  optimal  list  of  length  K . 


Proof.  See  Appendix  A  and  A. 2. 


□ 


Corollary  1.  If  a  no-regret  algorithm  is  used  on  the  sequence  of  losses  lt,  then  as  T  oo, 
-A  0  and 


lim  F(p,N)>(l-a)F(S*K). 

1  -^oo 

Proof.  See  Appendix  A  and  A. 2. 


(3.9) 

□ 


Theorem  3  provides  a  general  approximation  ratio  to  the  best  list  of  size  K ,  when  con¬ 
structing  a  list  of  a  different  size  N .  For  N  =  K,  we  obtain  the  typical  (1  —  approximation 
ratio  [Feige  1998].  As  K  increases,  this  provides  approximation  ratios  that  converge  expo¬ 
nentially  to  1.  Naively  one  might  expect  regret  E [R\/T  to  scale  linearly  in  Kr  as  it  involves 
loss  in  [0,iG].  However  we  show  that  regret  actually  scales  as  OVlP  ( e.g .  using  Weighted 
Majority  [Littlestone  and  Warmuth  1994]).  Our  result  matches  the  best  known  results  for 
this  setting  [Streeter  and  Golovin  2008],  while  using  a  single  online  learner,  and  is  especially 
beneficial  in  the  contextual  setting  due  to  improved  generalization. 

Corollary  2.  Using  Weighted  Majority  [Littlestone  and  Warmuth  1994]  the  optimal 
learning  rate  guarantees  with  probability  >1  —  5 

F(p,N)  >  (1  -  a)F(S’K)  -  (3,10) 

Proof.  See  Appendix  A  and  A. 2.  □ 


We  now  present  performance  guarantees  for  SCP  in  the  contextual  setting,  that  relate 
performance  on  the  submodular  list  optimization  task  to  the  regret  of  the  corresponding 
online  cost-sensitive  classification  task.  Let  It  :  n  — >►  R  compute  the  loss  of  each  hypothesis 
^  on  the  cost-sensitive  classification  examples  {(t^,  c^,  wti)}]L1  collected  in  Algorithm  12  for 
example  d.  We  use  {lt}J=  i  as  the  sequence  of  losses  for  the  online  learning  problem. 

3 Additionally,  if  the  distributions  pt  converge,  then  the  last  distribution  pr+i  must  have  performance 
arbitrarily  close  to  p  as  T  — >>  oo.  In  particular,  we  can  expect  this  to  occur  when  the  examples  are  randomly 
drawn  from  a  fixed  distribution  that  does  not  change  over  time. 
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For  a  deterministic  online  algorithm  that  picks  the  sequence  of  hypotheses  {^pt}t=i^  the 
regret  is: 

T  T 

R  =  '52h(ipt)-minY^ltW-  (3-11) 

t= 1  ipent=i 

For  a  randomized  online  learner,  let  fit  be  the  distribution  over  hypotheses  at  iteration 
t,  with  expected  regret: 


T  T 

E[i?]  =  -  min  (3-12) 

t= i  'ipe n  t= 1 

Let: 

F(^p,N)  =Es^jN~^d~v[f(S*p,N,d)]\,  (3.13) 

denote  the  expected  value  of  constructing  lists  (S^w)  by  sampling  (with  replacement)  N 
hypotheses  from  hypotheses  distribution  ijj  (if  ^  is  a  deterministic  hypothesis,  then  this  means 
we  use  the  same  hypothesis  at  each  position  of  the  list).  Let  if  =  argmaxt=| N) 
denote  the  best  distribution  found  by  the  algorithm  in  hindsight. 

We  use  a  mixture  distribution  if  over  hypotheses  to  construct  a  list  as  follows:  sample  an 
index  t  uniformly  in  {1,2,.  ..,T},  then  sample  N  learners  from  ift  to  construct  the  list.  As 
before,  we  note  that  F(if,N)  =  N)  and  F(if,N)  >  F(if,N).  We  again  focus 

on  providing  good  guarantees  for  F(if,  N)  as  shown  by  the  following  theorem: 

Theorem  4.  Let  a  =  exp(— ^)  and  K'  =  min(7V,  K).  For  any  5  E  (0, 1).  After  T  iterations, 
for  deterministic  online  algorithms,  with  probability  >1  —  5; 

m  N)  >  (1  —  a)F{S^K)  -  |  -  2^21n^1/<5),  (3.14) 

where  K  is  the  list  of  length  Kthat  can  be  constructed  by  the  best  distribution  if  over 
hypotheses.  Similarly,  for  randomized  online  algorithms,  with  probability  at  least  1  —  5: 


F$,N)>(l-a)F(S$tK) 


E[P] 

T 


1 2K'  ln(2 /S) 


(3.15) 


Proof.  See  Appendix  A  and  A. 2. 


□ 


Thus  as  in  the  context-free  case,  a  no- regret  online  algorithm  must  achieve  F{fj),  N)  > 
(1 —a)F{Sif  K)  with  high  probability  as  T  — >  oc.  This  matches  similar  guarantees  provided  by 
ConSeqOpt.  Despite  having  similar  guarantees,  we  intuitively  expect  SCP  to  outperform 
ConSeqOpt  in  practice  because  SCP  can  use  all  data  to  train  a  single  predictor,  instead 


37 


of  being  split  to  train  K  separate  ones.  We  empirically  verify  this  intuition  in  Chapter  3.2. 
When  using  surrogate  convex  loss  functions  (such  as  regression  or  ranking  loss),  we  provide 
a  general  result  that  applies  if  the  online  learner  uses  any  convex  upper  bound  of  the  cost- 
sensitive  loss.  An  extra  penalty  term  is  introduced  that  relates  the  gap  between  the  convex 
upper  bound  and  the  original  cost-sensitive  loss: 

Theorem  5.  Let  a  —  exp(— and  K'  =  min  (TV,  K).  If  we  run  an  online  algorithm  on  the 
sequence  of  convex  losses  Ct  instead  of  It,  then  after  T  iterations,  for  any  5  E  (0, 1),  we  have 
that  with  probability  at  least  1  —  5: 

N)>{\-  a)F(S;jK )  -  |  -  2^!^  -  g.  (3.16) 

Proof.  See  Appendix  A  and  A. 2.  □ 

This  result  implies  that  using  a  good  surrogate  convex  loss  for  no-regret  convex  optimiza¬ 
tion  will  lead  to  a  learner  (or  distribution  of  learners)  that  has  good  performance  relative 
to  the  optimal  list  of  learners.  Note  that  the  gap  Q,  often  may  be  small  or  non-existent. 
For  instance,  in  the  case  of  the  reduction  to  regression  or  ranking,  Q  —  0  in  realizable  set¬ 
tings  where  there  exists  a  “perfect”  hypothesis  in  the  class.  Similarly  where  the  problem  is 
near-realizable  we  would  expect  Q  to  be  small.  4 


3.2  Case  Studies 

3.2.1  Case  Study:  Robotic  Manipulation  Planning 

We  applied  SCP  to  the  robot  manipulation  planning  task  used  to  showcase  ConSeqOpt  in 
Chapter  2.1.  The  goal  is  to  predict  a  set  of  initial  trajectories  so  as  to  maximize  the  chance 
that  one  of  them  leads  to  a  collision-free  trajectory.  We  use  local  trajectory  optimization 
techniques  such  as  CHOMP  [Zucker  et  al.  2013],  which  have  proven  effective  in  quickly  finding 
collision-free  trajectories  using  local  perturbations  of  an  initial  trajectory.  Note  that  selecting 
a  diverse  set  of  initial  trajectories  is  important  since  local  techniques  such  as  CHOMP  often 
get  stuck  in  local  optima.5 

We  use  the  same  dataset  as  used  for  ConSeqOpt.  It  consists  of  310  training  and  212 
test  environments  of  random  obstacle  configurations  around  a  target  object,  and  30  initial 
seed  trajectories.  In  each  environment,  each  seed  trajectory  has  17  features  describing  the 

4We  conjecture  that  this  gap  term  Q  is  not  specific  to  our  particular  scenario,  but  rather  is  (implicitly) 

always  present  whenever  one  attempts  to  optimize  classification  accuracy  via  surrogate  convex  optimization. 

5  i.  e.  similar  or  redundant  inital  trajectories  will  lead  to  the  same  local  optima. 
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(a)  (b)  (c) 

Figure  3.1:  (a)  SCP  performs  better  at  even  low  data  availability  while  ConSeqOpt  suffers 
from  data  starvation  issues  (b)  With  increase  in  slots  SCP  predicts  news  articles  which  have 
lower  probability  of  the  user  not  clicking  on  any  of  them  compared  to  ConSeqOpt  (c) 
ROUGE-1R  scores  with  respect  to  the  size  of  the  training  data 


spatial  properties  of  the  trajectory  relative  to  obstacles.  In  addition  to  the  base  features, 
we  add  features  of  the  current  list  with  respect  to  each  initial  trajectory.  We  use  the  per 
feature  minimum  absolute  distance  and  average  absolute  value  of  the  distance  to  the  features 
of  initial  trajectories  in  the  list.  We  also  use  a  bias  feature  always  set  to  1,  and  an  indicator 
feature  which  is  1  when  selecting  the  element  in  the  first  position,  0  otherwise.  This  enables 
a  distinction  between  the  case  where  the  minimum  and  average  features  are  0  because  there 
are  no  trajectories  in  the  list  yet,  versus  when  they  are  0  because  we  are  actually  considering 
a  trajectory  which  is  already  in  the  list. 

We  compare  SCP  to  ConSeqOpt  (which  learns  N  separate  predictors),  and  Regression 
(regress  success  rate  from  features  to  sort  initial  trajectories;  this  accounts  for  relevance  but 
not  diversity). 

Figure  3.1  (left)  shows  the  failure  probability  over  the  test  environments  versus  the 
number  of  training  environments.  ConSeqOpt  employs  a  reduction  to  N  classifiers.  As  a 
consequence,  ConSeqOpt  faces  data  starvation  issues  for  small  training  sizes,  as  there  is 
little  data  available  for  training  predictors  lower  in  the  list.6  In  contrast,  SCP  has  no  data 
starvation  issue  and  outperforms  both  ConSeqOpt  and  Regression. 


3.2.2  Case  Study:  Personalized  News  Recommendation 

In  the  news  recommendation  setting  the  task  is  to  present  a  sequence  of  news  articles  to  a 
user  so  as  to  maximize  the  probability  of  the  user  clicking  on  at  least  1  recommended  article, 
which  is  similar  to  the  initial  trajectory  selection  problem  in  manipulation. 

6 When  a  successful  initial  trajectory  is  found,  benefits  at  later  positions  are  0.  This  effectively  discards 
training  environments  for  training  classifiers  lower  in  the  list  in  ConSeqOpt. 
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We  built  a  stochastic  user  simulation  based  on  75  user  preferences  derived  from  a  user 
study  in  [Yue  and  Guestrin  2011].  Using  this  simulation  as  a  training  oracle,  our  goal  is  to 
learn  to  recommend  articles  to  any  user  (depending  on  their  contextual  features)  to  minimize 
the  failure  case  where  the  user  does  not  like  any  of  the  recommendations.7 

Articles  are  represented  by  features,  and  user  preferences  by  linear  weights.  We  derived 
user  contexts  by  soft-clustering  users  into  groups,  and  using  corrupted  group  memberships 
as  contexts. 

We  perform  five-fold  cross  validation.  In  each  fold,  we  train  SCP  and  ConSeqOpt  on 
40  users’  preferences,  use  20  users  for  validation,  and  then  test  on  the  held-out  15  users. 
Training,  validation  and  testing  are  all  performed  via  simulation.  Figure  3.1  (middle)  shows 
the  results,  where  we  see  the  recommendations  made  by  SCP  achieves  significantly  lower 
failure  rate  as  the  number  of  recommendations  is  increased  from  1  to  5. 

3.2.3  Case  Study:  Document  Summarization 


Method 

ROUGE- IF 

ROUGE- IP 

ROUGE-1R 

SubMod 

37.39 

36.86 

37.99 

DPP 

38.27 

37.87 

38.71 

ConSeqOpt 

39.02  ±0.07 

39.08±0.07 

39.00±0.12 

SCP 

39.15±0.15 

39.16±0.15 

39.17±0.15 

Greedy  (Oracle) 

44.92 

45.14 

45.24 

Table  3.1:  ROUGE  unigram  score  on  the  DUC  2004  test  set 


In  the  extractive  multi-document  summarization  task,  the  goal  is  to  extract  sentences 
(with  character  budget  N)  to  maximize  coverage  of  human- annotated  summaries. 

Following  the  experimental  setup  from  [Lin  and  Bilmes  2010]  and  [Kulesza  and  Taskar 
2011],  we  use  data  from  the  Document  Understanding  Conference  (DUC)  2003  and  2004  (Task 
2)  [Dang  2005].  Each  training  or  test  instance  corresponds  to  a  cluster  of  documents,  and 
contains  approximately  10  documents  belonging  to  the  same  topic  and  four  human  reference 
summaries.  We  train  on  the  2003  data  (30  clusters)  and  test  on  the  2004  data  (50  clusters). 
The  budget  is  N  —  665  bytes,  including  spaces. 

We  use  the  ROUGE  [Lin  2004]  unigram  statistics  (ROUGE- 1R,  ROUGE- IP,  ROUGE- 
1F)  for  performance  evaluation.  Our  method  directly  attempts  to  optimize  the  ROUGE-1R 

rAlso  known  as  abandonment  [Radlinski  et  al.  2008]. 
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objective  with  respect  to  the  reference  summaries,  which  can  be  easily  shown  to  be  monotone 
submodular  [Lin  and  Bilmes  2011]. 

We  aim  to  predict  sentences  that  are  both  short  and  informative.  Therefore  we  maximize 
the  normalized  marginal  benefit, 

=  b(a\St>i-i)/l(a),  (3.17) 

where  1(a)  is  the  length  of  the  sentence  a.8  We  use  a  reduction  to  ranking  as  described  in 
Chapter  3  using  (3.17).  While  not  performance-optimized,  our  approach  takes  less  than  15 
minutes  to  train. 

Following  [Kulesza  and  Taskar  2011],  we  consider  features  Xi  for  each  sentence  consisting 
of  quality  features  qi  and  similarity  features  pi  (x|  =  [q[,  pf]T)-  The  quality  features,  attempt 
to  capture  the  representativeness  for  a  single  sentence.  Similarity  features  (pi  for  sentence  a{ 
as  we  construct  the  list  St  measure  a  notion  of  distance  of  a  proposed  sentence  to  sentences 
already  included  in  the  set.9 

Table  3.1  shows  the  performance  (Rouge  unigram  statistics)  comparing  SCP  with  exist¬ 
ing  algorithms.  We  observe  that  SCP  outperforms  existing  state-of-the-art  approaches,  which 
we  denote  SubMod  [Lin  and  Bilmes  2010]  and  DPP  [Kulesza  and  Taskar  2011].  “Greedy  (Or¬ 
acle)”  corresponds  to  the  clairvoyant  oracle  which  uses  the  Greedy  algorithm  9  to  directly 
optimize  the  test  Rouge  score  and  thus  serves  as  an  upper  bound  on  this  class  of  techniques. 
Figure  3.1  (right)  plots  Rouge-IR  performance  as  a  function  of  the  size  of  training  data, 
suggesting  SCP’s  superior  data-efficiency  compared  to  ConSeqOpt. 


8 This  results  in  a  knapsack  constrained  optimization  problem.  We  refer  the  reader  to  [Zhou  et  al.  2013] 
for  a  detailed  analysis. 

9  A  variety  of  similarity  features  were  considered,  with  the  simplest  being  average  squared  distance  of  tf-idf 
vectors.  Performance  was  very  stable  across  different  features.  The  experiments  presented  use  three  types: 
1)  following  the  idea  in  [Kulesza  and  Taskar  2011]  of  similarity  as  a  volume  metric,  we  compute  the  squared 
volume  of  the  parallelopiped  spanned  by  the  TF-IDF  vectors  of  sentences  in  the  set  St,k  U  a*;  2)  the  product 
between  det(Gst  kuaP  and  the  quality  features;  3)  the  minimum  absolute  distance  of  quality  features  between 
a*  and  each  element  in  St,k- 
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CHAPTER 


Multiple  Output  Structured  Prediction 

In  previous  chapters  (Chapter  2  and  3)  we  have  detailed  a  method  for  directly  predicting 
sets  and  lists  that  maximize  monotone  submodular  objectives  in  general  settings.  In  this 
chapter  we  propose  extensions  to  the  structured  output  setting.  Such  tasks  are  ubiquitious 
in  computer  vision  tasks  such  as  object  recognition  [Girshick  et  al.  2014],  semantic  segmen¬ 
tation  [Carreira  and  Sminchisescu  2010],  tracking  [Kalal  et  al.  2012],  monocular  human  pose 
estimation  [Yang  and  Ramanan  2011]  and  point  cloud  classification  [Munoz  et  al.  2010].  These 
tasks  are  often  addressed  by  a  pipeline  architecture  where  each  module  of  the  pipeline  pro¬ 
duces  several  hypotheses  as  input  to  the  next  module.  Considering  multiple  options  at  each 
stage  is  good  practice  as  it  avoids  premature  commitment  to  a  single  answer  which,  if  wrong, 
can  jeopardize  the  quality  of  decisions  made  downstream  [Felzenszwalb  and  McAllester  2007; 
Viola  and  Jones  2001].  As  an  example  consider  Figure  4.1  where  multiple  predictions  are 
generated  for  a  foreground/background  segmentation  task.  We  see  that  the  prediction  with 
the  highest  confidence  (denoted  by  prediction  1)  can  be  far  from  the  groundtruth.  The  prin¬ 
cipal  requirement  of  a  list  is  that  at  least  one  hypothesis  in  the  list  is  close  to  the  groundtruth 
labeling  (high  list  recall) .  A  characteristic  of  lists  which  achieve  high  recall  in  a  small  num¬ 
ber  of  hypotheses  is  diversity  [Radlinski  et  al.  2008]  which  increases  the  odds  of  at  least  one 
accurate  prediction. 

Our  central  insight  is  that  diversity  in  a  list  of  structured  predictions  need  not  be  enforced , 
but  that  it  is  an  emergent  property  of  optimizing  the  correct  submodular  recall  objective.  Sim¬ 
ilar  to  ConSeqOpt  our  procedure  trains  a  sequence  of  predictors,  each  of  which  produces 
a  hypothesis.  Consider  the  semantic  scene  labeling  problem  where  the  task  is  to  label  every 
pixel  with  a  semantic  label  like  “grass”,  “sky”,  etc.  It  is  not  beneficial  to  predict  a  labeling 
in  the  second  position  of  the  list  which  differs  from  the  first  labeling  in  the  list  only  by  a 
few  pixels.  Note  that  the  desired  property  is  to  achieve  high  recall,  while  diversity  is  merely 
a  characteristic  of  lists  that  achieve  this.  Therefore  our  objective  optimizes  for  recall  and 
does  not  explicitly  enforce  diversity,  instead  the  maximization  of  the  submodular  objective 
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Image  Prediction  1  Prediction  2  Prediction  3 


Figure  4.1:  For  a  given  image  our  method  trains  a  small  number  of  structured  predictors  in 
sequence.  For  a  test  image,  the  list  of  predictors  are  invoked  to  produce  multiple  hypotheses. 
Our  approach  produces  high  recall  lists  within  a  small  number  of  hypotheses  and  can  use  any 
structured  predictor  available. 

naturally  produces  diverse  hypotheses.  Conveniently,  as  summarized  in  Chapter  2,  submod- 
ular  monotone  functions  can  be  maximized  efficiently  by  greedily  maximizing  the  marginal 
benefit  which  ensures  performance  within  1  —  e  63%)  of  the  optimal  list  of  items  of  a  fixed 
length  [Nemhauser  et  al.  1978]. 

Making  a  single  best  prediction  in  structured  problems  is  difficult  due  to  the  combina- 
torially  large  state  space  that  has  to  be  considered.  While  a  number  of  approaches,  both 
probabilistic  [Lafferty  et  al.  2001;  Kohli  et  al.  2013]  and  margin-based  [Tsochantaridis  et  al. 
2005;  Taskar  et  al.  2003],  for  learning  and  inference  in  structured  problems  are  well  known, 
methods  for  making  multiple  predictions  in  structured  problem  domains  are  relatively  few 
[Guzman-Rivera  et  al.  2012;  Batra  et  al.  2012;  Park  and  Ramanan  2011;  Kulesza  and  Taskar 
2010].  We  develop  a  learning-based  approach  to  produce  a  small  list  of  structured  predictions 
that  ensures  high  recall  in  a  variety  of  computer  vision  tasks. 

In  contrast  to  recent  developments  which  train  a  single  model  during  the  learning  phase 
and  modify  the  inference  procedure  to  produce  multiple  hypotheses  at  test  time  [Batra  et  al. 
2012;  Park  and  Ramanan  2011;  Kulesza  and  Taskar  2010],  our  approach  trains  separate 
predictors  during  the  learning  phase  to  produce  each  of  the  hypotheses  in  the  list.  This 
alternate  approach  has  several  advantages — the  learning  procedure  is  optimized  for  the  task 
of  producing  a  list  with  high  recall;  diversity  does  not  need  to  be  enforced  in  an  ad  hoc 
fashion  but  is  an  emergent  property  of  lists  that  maximize  our  objective;  it  is  agnostic  to 
the  inference  method  used  and  can  be  utilized  for  any  class  of  structured  predictor.  We 
empirically  demonstrate  our  approach  on  common  vision  tasks  such  as  estimating  human  pose 
from  a  single  image,  semantic  scene  segmentation  and  foreground/background  segmentation. 
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The  primary  contributions  of  our  approach  are: 

•  Our  approach  is  a  model  agnostic  framework  applicable  to  extending  any  structured  pre¬ 
diction  algorithm  to  make  multiple  predictions.  In  any  task  domain  for  which  learning 
algorithms  exist  to  generate  a  single  best  prediction,  our  approach  can  be  employed  for 
making  multiple  predictions  by  training  multiple  instances. 

•  Our  approach  is  parameter  free.  In  contrast,  current  state-of-the-art  approaches  enforce 
diversity  by  explicitly  introducing  a  diversity  modeling  term  in  the  objective  function.  Such 
parameters  are  tuned  on  validation  data.  It  is  not  clear  that  artificially  enforcing  diversity 
in  such  a  way  is  the  right  thing  for  the  task  at  hand  to  achieve  the  best  performance 
[Caruana  et  al.  2004;  Misra  et  al.  2014]. 

•  We  study  the  empirical  performance  of  our  approach  and  demonstrate  state-of-the-art 
results  on  multiple  predictions  for  monocular  pose  estimation  and  foreground/ background 
segmentation  on  benchmark  datasets. 


4.1  Related  Work 

Kulesza  et  al.  [Kulesza  and  Taskar  2011]  have  adapted  determinantal  point  processes  (DPP), 
a  model  used  in  particle  physics  for  optimizing  for  diverse  but  low  error  predictions.  DPPs 
are  especially  attractive  because  they  allow  for  efficient,  exact  inference  procedures  and  are 
similar  to  monotone,  submodular  optimization  methods. 

The  related  work  in  multiple  structured  prediction  can  be  grouped  into  two  categories: 
1)  The  first  are  methods  which  are  model- dependent.  These  methods  are  tied  to  the  specific 
learning  and  inference  procedure  being  used  (e.g.  S-SVM,  CRF)  and  cannot  easily  be  adapted 
to  different  structured  prediction  methods.  2)  The  second  category  of  models  are  model- 
agnostic ,  which  are  not  tied  to  the  specifics  of  the  chosen  structured  prediction  method. 

Model-dependent  methods:  Batra  et  al.  [Batra  et  al.  2012]  deal  with  the  problem  of 
inferring  low  error  and  diverse  solutions  from  a  Markov  Random  Field  (MRF).  They  approach 
this  problem  by  introducing  a  constraint  in  the  MRF  objective  function  which  says  that 
a  new  solution  must  be  at  least  some  distance  away  from  each  of  the  previous  solutions. 
The  constraint  is  moved  to  the  objective  by  a  Lagrangian  multiplier  A  and  then  solved 
using  a  supergradient  algorithm.  A  is  treated  as  a  free  parameter  and  is  optimized  over  a 
validation  set.  They  term  this  approach  as  DivMBest.  Note  that  a  single  model  is  initially 
learnt  (the  MRF)  and  only  during  inference  time  diverse  solutions  are  obtained  by  imposing 
constraints  on  the  inference  procedure.  Nilson  et  al.  [Nilsson  1998]  and  Weiss  et  al.  [Yanover 
and  Weiss  2004]  propose  methods  for  using  loopy  belief  propagation  for  finding  the  most 
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probably  solutions  at  inference  time  in  graphical  models.  But  their  methods  don’t  try  to 
incorporate  diversity  to  improve  performance.  Park  and  Ramanan  [Park  and  Ramanan  2011] 
use  a  modified  version  of  standard  max-product  inference  which  aims  to  enforce  diversity  by 
incorporating  part-overlap  constraints,  which  they  term  as  NBest. 

In  the  approach  proposed  by  Guzman- Rivera  et  al.  [Guzman- Rivera  et  al.  2012],  a  struc¬ 
tured  SVM  [Tsochantaridis  et  al.  2005]  (S-SVM)  is  trained  for  each  position  of  the  list.  During 
inference  time,  each  S-SVM  is  invoked  to  predict  a  structured  output.  They  minimize  an  up¬ 
per  bound  of  the  non-convex,  structured  hinge  loss  via  a  kmeans-based  initialization  step  and 
an  expectation-maximization  (EM)  style  coordinate-descent  minimization  algorithm.  They 
term  their  approach  as  Multiple  Choice  Learning  (MCL).  In  more  recent  work,  Guzman- 
Rivera  et  al.  [Guzman-Rivera  et  al.  2014a]  explicitly  add  diversity  to  the  MCL  objective  and 
optimize  a  surrogate  via  an  EM  style  block  coordinate-descent  minimization  routine  similar  to 
MCL.  An  extra  parameter  which  trades  off  between  diversity  and  accuracy  is  then  tuned  via 
cross-validation.  They  term  this  approach  as  Diverse  Multiple  Choice  Learning  (DivMCL). 

In  comparison  to  such  model-dependent  methods,  our  proposed  method  is  model-agnostic 
and  can  use  any  structured  prediction  approach. 

Model-agnostic  methods:  To  the  best  of  our  knowledge,  the  only  such  method  is  the  ad 
hoc  boosting-like  weighting  scheme  used  in  [Guzman-Rivera  et  al.  2012,  2014b]  which  we 
denote  henceforth  as  GR14.  The  weighting  scheme  of  GR14  [Guzman-Rivera  et  al.  2014b] 
has  been  used  for  specific  task  of  camera  re-localization.  This  method  has  a  free  parameter 
which  must  be  tuned  on  validation  data.  In  comparison  our  approach  is  parameter  free  and 
achieves  comparable  or  better  results  on  standard  vision  tasks. 


4.2  Approach 

Structured  prediction  problems  in  machine  learning  and  computer  vision  are  characterized  by 
a  multidimensional  structured  output  space  T,  where  the  notion  of  structure  varies  according 
to  the  problem.  For  example,  in  semantic  scene  understanding,  the  structure  in  the  output 
y  E  y  refers  to  the  fact  that  nearby  regions  in  the  image  tend  to  have  correlated  semantic 
labels.  In  human  pose  estimation  from  images,  the  location  of  a  limb  in  the  image  is  correlated 
with  the  locations  of  other  limbs. 

One  possible  approach  to  structured  predictions  could  be  to  use  the  well  understood 
approach  of  multi-class  classification  by  treating  each  possible  structured  output  as  a  label. 
If  this  were  possible,  multiple  low  error  and  diverse  interpretations  could  be  directly  generated 
using  a  scheme  such  as  ConSeqOpt  [Dey  et  al.  2013],  described  in  Chapter  2.  However, 
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the  challenge  in  such  structured  prediction  tasks  is  that  the  space  of  possible  output  variable 
combinations  is  exponential  in  the  number  of  labels  for  each  variable.  For  example  for  an 
image  with  104  pixels  and  21  possible  labels  for  each  pixel  there  are  211C)4  possible  labelings. 
This  is  also  why  structured  prediction  tasks  cannot  be  addressed  by  multi-class  classification 
as  the  number  of  classes  is  exponentially  large.  As  a  result,  directly  applying  a  procedure 
such  as  ConSeqOpt  to  generate  multiple  interpretations  is  infeasible. 

Our  approach  is  inspired  by  the  ideas  set  forth  in  ConSeqOpt  and  SCP.  We  define  a 
monotone  submodular  function  over  a  list  of  structured  predictors  and  show  that  a  simple 
greedy  algorithm  can  be  used  to  train  a  list  of  such  predictors  to  produce  a  set  of  structured 
predictions  with  high  recall.  More  formally  our  problem  can  be  stated  as  follows. 

Problem  Statement:  The  goal  of  our  approach  is,  given  an  input  image  /  G  X,  to  produce  a 
list  of  N  structured  outputs  Y (/)  =  {yi,  y2, . . . ,  yw}  £  Y  with  low  error  and  high  recall.  We 
formulate  this  as  the  problem  of  learning  a  list  of  structured  predictors  S  =  {Al,  h 2, •  •  • ,  h^} 
where  each  predictor  hi  :  X  — )►  y,  hi  G  H,  in  the  list  produces  the  corresponding  structured 
output  y i,  where  H  is  a  hypothesis  class  of  structured  predictors  and  y  is  the  space  of 
structured  predictions. 

We  begin  by  describing  a  submodular  objective  that  captures  the  notion  of  low  error  and 
high  recall.  Let  us  denote  jth  training  sample  as  a  tuple  {(1^  ,ygt)}jei...\v\i  where  for  each 
image  /J  ,  the  ground  truth  structured  label  is  denoted  by  yJgt.  We  denote  by  l  :  y  x  y  [0,1], 
a  loss  function  that  measures  the  disagreement  between  the  predicted  structured  output  y 
and  the  ground  truth  structured  label  y gt.  The  corresponding  function  measuring  gain  is 
thus  given  by  g( y,  ygt)  —  1  —  Z(y,  ygt)-  We  define  a  list  of  structured  outputs  as: 

YS(I)  =  M/)  W.JV.  (4.1) 


We  then  define  the  quality  function, 

f(Ys(I),ygt)  =  .max  {g{Hi(I), y^)},  (4.2) 

=  1  -  min  {l(Hi(I),ygt)}  (4.3) 

that  scores  the  list  of  structured  predictions  Ys(I)  by  the  score  of  the  best  prediction  produced 
by  the  list  of  predictors  S  —  {Al,  ^2  •  •  • ,  ^hv}-  We  note  that  to  maximize  this  scoring  function 
with  respect  to  the  list  of  predictions  at  least  one  of  the  predictions  y \  in  the  list  needs  to 
be  close  to  the  ground  truth.  In  order  to  learn  a  list  of  predictors  that  works  well  across 
a  distribution  of  the  data,  the  objective  function  we  would  like  to  optimize  is  the  expected 
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value  of  the  above  function  over  the  distribution  of  the  data  V: 

F(S,V )  =  [f(Ys(I),  ygt)]  .  (4.4) 

The  resulting  optimization  problem  is  therefore  to  find  the  list  of  predictors  S  that 
maximizes  the  objective  function  F  in  Equation  4.4  and  can  be  written  as  follows: 


[f(Xs(I),ygt)\-  (4-5) 

The  function  F  of  the  form  in  Equation  4.4  can  be  shown  to  be  a  monotone  submodular 
function  over  lists  of  input  items  as  shown  in  Appendix  B,  [Dey  et  al.  2012].  The  natural 
approach  for  submodular  optimization  problems  of  the  form  in  4.5  is  to  use  a  greedy  algorithm 
[Nemhauser  et  al.  1978].  In  each  greedy  step  i,  we  add  the  structured  predictor  H*,  that 
maximizes  the  marginal  benefit.  For  our  objective,  maximizing  the  marginal  benefit  is  written 
as: 

K  =  argmax  F(Si- 1  ©  {h},V)  -  F(Si- UV).  (4.6) 

hen 

Maximizing  the  marginal  benefit,  as  written  above,  over  the  space  of  structured  predictors  by 
enumeration  is  difficult,  because  there  can  be  uncountably  many  such  predictors.  Instead,  we 
take  the  approach  of  directly  training  a  structured  predictor  to  maximize  the  marginal  benefit. 
As  we  do  not  have  access  to  the  true  distribution  of  the  data,  we  maximize  the  marginal 
benefit  using  the  empirical  distribution  V.  We  denote  the  loss  lj  =  l(yj,  yJgt)  as  shorthand  for 
the  loss  of  the  zth  predictor  on  the  jth  training  sample.  Rewriting  the  objective  with  respect 
to  the  empirical  data  distribution  and  in  terms  of  the  loss  per  example  we  have, 


E  (min  {l{, . . . ,  zyj  -  min  {l{, . . . ,  l\}\  , 

(4.7) 
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where  ^_1  in  4.9  is  the  minimum  loss  obtained  by  the  list  of  i  —  1  predictors  on  jth  sample  till 
now.  Since  training  procedures  for  structured  predictors  are  usually  implemented  to  minimize 
loss  we  rewrite  4.9  and  4.6  as: 


47 


Algorithm  14  SeqNBest:  Algorithm  for  training 
Input:  List  length  N, 

Structured  prediction  routine  H  eT~L, 
Dataset  V  of  \V\  examples, 

1:  S  =  {},  {wo  =  Uje i...\v\ 

2:  for  i  =  1  to  N  do 

3:  Hi  —  trainStructuredPredictor(P,  w^_i) 

4:  S  i —  S  ©  Hi 

5:  =  computeMarginalWeights(Sf,  V) 

6:  end  for 

7:  Return:  S  =  {Hi,  H2, . . . ,  Hjy} 


Algorithm  15  computeMarginalWeights  (S,f>) 

Input:  List  of  trained  structured  predictors  S , 
Dataset  V 

1:  for  j  =  1  to  \V\  do 

2:  l  =  {} 

3:  for  i  =  1  to  |S|  do 

4:  li  =  l(Hi(P),yigt) 

5:  l  i —  l  ®  li 

6:  end  for 

7:  £  min(Z) 

8:  if  SeqNBestI  then 

9:  wi  <—  £ 

10:  else  if  SeqNBest2  then 

11:  W>  <-  £3/  (3£2  -  3£  +  l) 

12:  end  if 

13:  end  for 

14:  Return:  w  =  {w1,  w2, . . . ,  w^} 


Algorithm  16  SeqNBest:  Algorithm  for  inference 

Input:  Trained  list  of  classifiers  Hi,  ^2, . . . ,  Hn, 

Test  example  / 

Output:  List  of  structured  predictions  S 
1:  S={} 

2:  for  i  —  1  to  N  do 

3:  y  <5—  hi(I) 

4:  S  i-  S  (B  y 

5:  end  for 
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(4.10) 


n* 


=  argmaxV'  max  (%j_i  —  lj,  0)  , 
hen  -  v  J 


jev 


—  argminY^  min  (lj  —  ^_1?  0 
hen  -  v 


(4.11) 


Let  us  denote  the  per-example  desired  loss  /Actual  =  min  (lj  —  ^_1?  0^  which  is  the 
summand  in  Equation  4.10.  Consider  the  relationship  of  the  loss  /Actual  as  a  function  of  the 
loss  of  the  current  predictor  lj  and  the  best  loss  seen  before  the  current  predictor  (£j_i)-  This 
is  drawn  in  Figure  4.2a  and  denoted  by  the  line  /Actual-  We  observe  that  if  a  predictor  obtains 
a  loss  greater  than  the  previous  best,  £j_1  on  an  example  it  does  not  contribute  towards 
lowering  of  the  loss  defined  in  Equation  4.10.  Whereas  if  it  achieves  loss  less  than  it 
lowers  the  objective  by  the  same  amount  that  it  is  less  than  £j_ Optimizing  such  a  loss 
directly  tends  to  be  difficult  as  it  can  require  modifications  that  are  specific  to  the  structured 
predictor’s  training  procedure.  Instead,  we  take  the  approach  of  optimizing  a  tight  linear 
upper  bound  of  the  loss  (/seqNBesti  in  Figure  4.2a)  which  results  in  a  procedure  that  only 
requires  re-weighting  the  training  data  and  is  model- agnostic.  Consider  a  linear  upper  bound 
on  /Actual  defined  by  the  parameter  wj , 


^ Actual •  (4.12) 

Training  a  predictor  which  optimizes  the  surrogate  loss  on  the  left  hand  side  of  4.12  is 
equivalent  to  training  a  structured  predictor  which  weights  each  data  sample  with  the  weight 


J2Wili  -  E  min  (yli  -  1>  0)  •  (4-13) 

jev  jev 

Note  that  by  setting  the  weight  of  each  sample  to  be  proportional  to  the  marginal  benefit 
left  (wj  oc  £j_ i)  we  are  minimizing  a  tight  linear  upper  bound  of  the  actual  loss  function  we 
wish  to  minimize  (/Actual)-  This  relationship  is  indicated  by  the  line  /seqNBesti  in  Figure  4.2b. 
Our  training  procedure  might  be  reminiscent  of  boosting  [Freund  et  al.  1999]  where  several 
predictors  are  combined  to  produce  a  single  output.  In  contrast,  our  procedure  is  trained  to 
produce  a  list  of  predictors  each  of  which  makes  a  separate  prediction  in  a  list  of  predictions. 
Additionally  we  are  also  optimizing  a  completely  unrelated  loss  function. 

An  alternative  tight  linear  upper  bound  can  be  calculated  by  minimizing  the  L 2  norm 
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between  /Actual  and  a  linear  loss  function  given  by  w\  lj .  Consider  a  family  of  linear  upper 
bounds  of  the  quality  function  /Actual  which  has  the  form 

feeqNBest2  =  wjlj  +  b.  (4.14) 

Note  that  to  achieve  a  tight  upper  bound,  candidate  lines  must  pass  through  the  point 
(£i_i,0).  Substituting  (^_1,0)  in  Equation  15  we  get 


0  =  aff_i  +  &,  (4.15) 

b  =  -a£l_  i-  (4.16) 


To  obtain  the  tightest  upper  bound  (in  a  L2  sense)  we  minimize  the  L2  distance  between 
/Actual  and  ZseqNBest2  lo  obtain  cl.  The  L2  distance  between  /Actual  and  ZseqNBest2  is 


—  J  ||/SeqNBest2  /Actually  ^Z, 

=  -5Li)  -  (oil  +6)]2cii 

Jo 

+  [  (n/^  +  b)2dl. 


(4.17) 

(4.18) 

(4.19) 


Differentiating  the  expression  for  A  with  respect  to  a  (the  slope  of  the  line),  setting  it  to 
0,  and  using  the  constraint  that  the  line  must  pass  through  (^_l50),  we  get 


a  — 


(3-i) 


3(CJ_i)2  -  3CJ-i  +  1 


(4.20) 


This  gives  the  optimal  slope  of  the  line  ZseqNBest2  which  minimizes  the  gap  between  it  and 
/Actual  • 


The  graphical  relationship  between  ^j_1  and  the  optimal  weight  (slope  of  the  line)  in 
ZseqNBest2  is  shown  in  Figure  4.2a.  We  see  that  ZseqNBesti  weights  the  examples  directly 
proportional  to  the  previous  best  loss  while  ZseqNBest2  tends  to  aggressively  upweight 

hard  samples  which  have  high  best  previous  loss  (££_ x  >  0.5)  and  aggressively  downweights 
easier  examples  which  have  low  best  previous  loss  (£^_x  <  0.5). 


We  summarize  our  algorithm  in  Algorithm  14.  We  begin  by  assigning  a  weight  of  1  to 
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(a)  Upper  bound  relationship  between  the  pro¬ 
posed  surrogate  loss  and  the  desired  actual  loss 
function 
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(b)  Comparison  between  the  proposed  weighting 
schemes  as  a  function  of  the  best  loss  seen  so  far. 


each  training  sample  in  the  dataset.  In  each  iteration  i,  we  train  the  structured  predictor  Hi 
with  the  dataset  V  and  associated  weights  for  each  training  sample  w  and  append  it  to  the 
list  of  predictors  S.  We  recompute  the  weights  w  using  the  scheme  described  in  Algorithm  15. 
We  iterate  for  the  specified  N  iterations  and  return  the  list  S  =  {Hi, . . . ,  Hjy}  of  structured 
predictors.  We  term  this  simple  but  powerful  approach  as  “Sequential  N-Best”  or  SeqNBest. 


4.2.1  An  Example 

As  an  example,  on  the  task  of  image  segmentation  the  required  inputs  are  the  number  of 
predictions  N ,  we  want  to  make  per  example,  training  dataset  V  and  the  structured  prediction 
procedure  for  learning  and  inference  H.  We  illustrate  the  algorithm  via  a  toy  dataset  of  3 
images  (See  Figure  4.2),  where  the  task  is  to  perform  foreground/background  segmentation 
by  marking  each  pixel  with  either  the  foreground  or  background  label.  Assume  that  we  have 
trained  2  predictors  already,  and  are  calculating  the  importance  (weight)  of  each  image  for 
the  3rd  predictor.  The  second  and  third  rows  show  the  performance  of  the  two  predictors  on 
these  images.  Note  that  none  of  the  predictors  do  well  on  the  image  of  the  elephant,  however 
one  of  the  predictors  does  really  well  on  the  helicopter.  This  tells  us  intuitively,  that  training 
the  third  predictor  should  concentrate  more  on  the  image  of  the  elephant,  but  not  as  much 
on  the  other  two  since  at  least  one  of  the  previous  predictors  has  done  relatively  well  on  it. 
The  last  row  in  the  figure  shows  the  weights  for  each  image  which  is  the  minimum  of  the 
errors  obtained  by  all  previous  predictors.  This  weighting  rule  achieves  the  desired  behavior 
of  working  harder  on  examples  which  none  of  the  previous  predictors  have  performed  well 
on. 
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Figure  4.2:  Illustration  of  SeqNBest  training  procedure.  Consider  a  toy  training  dataset  of  3  images 
(chosen  from  the  iCoseg  dataset  [Batra  et  al.  2010],  where  the  task  is  to  do  foreground/background 
separation.  The  first  predictor  gets  30%  pixel  error  on  the  bear  image,  while  the  second  predictor 
gets  90%  pixel  error.  Intuitively,  since  the  first  predictor  did  well  already  on  this  image,  we  should 
not  try  as  hard  on  this  image  compared  to  the  elephant  image  where  none  of  the  2  predictors  did  very 
well.  The  rule  for  weighting  data  points  for  training  the  next  predictor  is  minimum  of  the  error  by 
the  previous  predictors  and  the  last  column  shows  this  being  applied  to  this  contrived  example.  Note 
that  the  elephant  image  has  the  highest  weight  since  none  of  the  previous  predictors  did  well  on  it, 
while  the  helicopter  one  has  the  lowest  weight,  since  the  first  predictor  did  really  well  on  it. 


4.3  Case  Studies 

We  evaluate  our  methods  against  both  model- dependent  [Park  and  Ramanan  2011;  Guzman- 
Rivera  et  al.  2012;  Batra  et  al.  2012]  and  model-independent  methods  [Guzman-Rivera  et  al. 
2014b]  (See  Chapter  4.1).  Note  that  the  weighting  scheme  of  GR14  [Guzman-Rivera  et  al. 
2014b]  has  been  used  for  the  specific  task  of  camera  re-localization  and  published  results  on 
standardized  datasets  do  not  exist.  We  make  a  best  effort  comparison  by  reimplementing 
their  method  for  standardized  tasks. 

We  demonstrate  that  using  our  simple  yet  powerful  weighting  scheme  results  in  bet¬ 
ter  performance  than  model-dependent  methods  and  comparable  or  better  performance  for 
model- agnostic  methods  with  much  less  computation  due  to  lack  of  parameter  tuning  step. 
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4.3.1  Case  Study:  Human  Pose  Tracking  in  Monocular  Sequences 

In  monocular  pose  estimation  the  task  is  to  estimate  the  2D  locations  of  anatomical  land¬ 
marks  from  an  image.  The  task  is  challenging  due  to  the  large  variation  in  appearance  and 
configuration  of  humans  in  images.  Additional  challenges  are  posed  by  partial  occlusions, 
self-occlusions,  and  foreshortening.  A  related  task  is  to  track  the  pose  of  a  human  subject 
through  a  sequence  of  frames  of  video.  In  the  tracking  by  detection  paradigm  of  human  pose 
tracking,  multiple  hypothesis  poses  are  generated  per  frame  of  video  and  then  stitched  to¬ 
gether  using  a  data  association  algorithm.  This  avoids  making  hard  commitment  to  a  single 
best  pose  at  a  frame.  As  long  as  the  correct  pose  is  present  amongst  the  multiple  hypothe¬ 
sized  poses  for  each  frame,  the  algorithm  can  have  a  chance  at  picking  the  correct  one  using 
additional  temporal  information. 

Datasets:  We  evaluate  our  method  on  producing  multiple  predictions  for  each  image  in  the 
PARSE  dataset  used  introduced  by  Yang  and  Ramanan  [Yang  and  Ramanan  2011]  and  on 
the  tracking  datasets  introduced  in  Park  and  Ramanan  [Park  and  Ramanan  2011]  named 
“lola”,  “lola”,  “walkstraight”  and  “baseball”.  We  use  the  same  model,  code  and  training  set 
as  Yang  and  Ramanan  [Yang  and  Ramanan  2011]  and  use  our  two  weighting  methods  to 
train  N  models  as  detailed  in  Algorithm  14  to  produce  4  models.  We  use  the  same  test  set 
used  by  Yang  and  Ramanan  to  compare  the  average  percentage  of  correct  parts  (PCP)  of  the 
best  pose  as  the  number  of  pose  hypotheses  is  increased  from  1  to  4. 

Analysis  Figure  4.3  shows  that  as  the  number  of  hypotheses  is  increased  SeqNBestI  and 
SeqNBest2  find  accurate  poses  earlier  in  the  list  than  NBest.  The  figure  plots  the  average 
across  the  test  set,  of  the  best  pose  predicted  as  the  number  of  pose  hypotheses  is  increased. 
Batra  et  al.[Batra  et  al.  2012]  refer  to  this  as  the  “oracle”  accuracy  of  a  list  of  predictions.  We 
show  results  with  NBest  with  and  without  non-maximum  suppression  post  processing.  Note 
that  even  with  non-maximum  suppression,  NBest  is  unable  to  outperform  SeqNBest,  which 
requires  no  post-processing  step.  We  also  compare  against  the  boosting-like  weighting  scheme 
of  GR14  [Guzman- Rivera  et  al.  2014b].  GR14  performs  marginally  better  than  SeqNBest, 
achieving  81.95%  oracle  accuracy  compared  to  SeqNBestI’s  80.83%  by  position  4.  Note  that 
this  boosting-like  weighting  scheme  has  a  free  parameter  which  is  tuned  by  cross-validation, 
while  we  are  parameter  free.  We  used  the  exact  same  set  of  values  of  this  free  parameter  as 
used  in  [Guzman-Rivera  et  al.  2014b]  to  tune  it  for  all  our  following  experiments. 

In  Figure  4.1  we  compare  the  performance  of  DivMBest  with  respect  to  SeqNBestI 
and  SeqNBest2.  Three  models  were  trained  using  the  two  SeqNBest  schemes  on  the 
PARSE  training  set  and  then  compared  to  the  “oracle”  PCPs  reported  by  NBest  and  Di¬ 
vMBest.  In  each  video  sequence  SeqNBestI  or  SeqNBest2  achieves  higher  recall.  In  the 
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Figure  4.3:  As  the  number  of  pose  hypotheses  allowed  is  increased  from  one  to  four,  Se- 
qNBest  predicts  more  accurate  poses  compared  to  NBest  with  non-maximum  suppression. 
Both  start  out  at  73.8%  percentage  of  correct  parts  since  the  first  position’s  model  is  identical 
to  both  but  by  the  Ath  position  SeqNBest  has  achieved  81.61%  average  best  accuracy  while 
NBest  achieves  79.37%. 


“walkstraight”  dataset  SeqNBestI  achieves  98.5%  PCP  in  3  positions  where  DivMBest 
needs  100  predictions  to  reach  the  same  accuracy.  Similarly  for  “lolal”,  20  predictions, 
“lola2”,  7  predictions  and  for  “baseball”  7  predictions  are  needed  by  DivMBest  to  reach  the 
same  “oracle”  PCP  as  SeqNBest2.  Note  that  GR14  after  much  tuning  on  validation  data 
is  still  behind  SeqNBest  on  all  four  videos. 


4.3.2  Case  Study:  Image  Foreground/ Background  Separation 

We  apply  our  method  to  the  task  of  foreground/background  segmentation  where  the  task  is 
to  assign  each  pixel  in  an  image  with  either  the  foreground  or  background  label. 

Dataset:  We  use  the  set  of  166  images  of  the  iCoseg  dataset  [Batra  et  al.  2010],  spanning 
9  different  events,  as  used  by  MCL  [Guzman- Rivera  et  al.  2012].  The  dataset  is  roughly, 
equally  split  into  training,  validation  and  test  sets.  The  exact  splits  were  provided  to  us  by 
the  authors  of  MCL. 
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Figure  4.4:  For  each  of  the  three  images  the  top  row  is  the  list  of  4  pose  hypothesis  by  NBest 
while  the  bottom  4  are  by  SeqNBest.  For  baseball  player  SeqNBest  predicts  the  correct 
pose  in  the  2nd  guess,  for  the  gymnast  in  the  3rd  guess  and  the  Ath  guess  for  the  cyclist.  Note 
that  in  each  case  SeqNBestI  produces  poses  which  are  diverse  from  each  other  while  trying 
to  be  relevant  to  the  scene.  In  each  case  NBest  produces  poses  which  are  almost  identical 
to  each  other  and  none  of  which  are  close  to  the  ground  truth  pose. 
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Number  of  Predictions 
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Number  of  Predictions 
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Table  4.1:  Comparison  of  SeqNBestI  and  SeqNBest2  to  NBest  and  DivMBest.  The 
average  best  PCP  plotted  as  the  budget  for  generating  hypotheses  is  increased.  In  each  case 
SeqNBestI  and/or  SeqNBest2  predicts  more  accurate  poses  for  the  number  of  hypotheses 
allowed. 


Analysis:  We  compare  the  performance  of  SeqNBest  to  MCL  in  two  ways:  1)  We  use  the 
exact  implementation  of  S-SVM  provided  to  us  by  the  authors  of  MCL  as  the  structured 
predictor  routine  in  SeqNBest  to  train  6  predictors  2)  Secondly,  to  showcase  the  flexibility 
of  SeqNBest  to  use  any  structured  predictor  available,  we  use  the  Hierarchical  Inference 
Machine  (HIM)  algorithm  by  Munoz  et  al.  [Munoz  et  al.  2010]  to  train  SeqNBest.  We 
use  texture  and  C-SIFT  [Gould  et  al.  2010]  as  features.  Figure  4.5  (left)  shows  the  “oracle” 
accuracy  of  a  list  of  predictions.  Additionally  we  compared  against  GR14  [Guzman-Rivera 
et  al.  2014b].  We  find  that  using  the  same  predictor  and  features  as  in  MCL,  SeqNBestI  and 
MCL  have  comparable  performance  in  Figure  4.5  (left).  When  HIM  is  used  as  the  structured 
predictor  (Figure  4.5  (right)),  it  performs  much  better  from  the  first  position  and  obtains  6% 
average  best  error  in  6  predictions.  The  reduction  of  error  stops  after  the  first  3  positions 
because  the  HIM  model  starts  approaching  the  theoretical  limits  of  its  performance  on  the 
test  set,  which  is  2%  (this  was  obtained  by  training  and  testing  HIM  on  the  test  set  itself). 
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Figure  4.5:  Average  best  pixel  error  in  the  image  background,  foreground  segmentation  task 
as  number  of  predictions  are  increased.  SeqNBest  (with  S-SVM)  uses  the  same  S-SVM 
structured  predictor  routine  as  MCL. 


In  summary,  variants  of  SeqNBest  performed  on  par  with  model-dependent  methods 
like  MCL,  which  have  the  advantage  of  leveraging  the  specifics  of  the  chosen  structured 
predictor  (in  this  case  S-SVM).  SeqNBest,  however,  is  model-agnostic  and  can  be  readily 
applied  to  any  structured  predictor.  We  find  that  SeqNBest  used  in  conjunction  with  HIM 
outperforms  the  other  model-agnostic  method,  GR14,  which  is  also  trained  with  HIM  as  the 
base  predictor  (Figure  4.5  (right)).  This  also  serves  as  an  example  of  SeqNBest’s  flexibility 
in  being  able  to  plug-in  any  powerful  predictor. 


4.3.3  Case  Study:  Image  Segmentation 

As  mentioned  earlier  semantic  scene  segmentation  is  a  very  challenging  task,  where  every 
pixel  in  an  image  has  to  be  assigned  a  semantic  label  like  “boat”,  “sky”  etc.  In  this  section 
we  show  initial  promising  results  with  SeqNBest.  Note  that  these  are  not  meant  to  be 
competitive  with  the  most  recent  state-of-the-art  advances  in  image  segmentation  but  meant 
to  showcase  the  flexibility  of  our  approach  in  using  any  predictor. 

Dataset:  In  PASCAL  VOC  2012  segmentation  challenge  [Everingham  et  ah]  the  task  is 
to  mark  every  test  image  with  one  of  20  class  labels  or  the  background  class.  Figure  4.6 
shows  some  example  images  and  their  annotated  groundtruth  labels.  There  are  1464  images 
in  train  and  1449  in  the  val  set  which  we  use  as  the  test  set  in  our  experiments  below. 

Analysis:  We  use  the  Hierarchical  Inference  Machine  (HIM)  algorithm  by  Munoz  et  al.  [Munoz 
et  al.  2010]  to  learn  5  structured  predictors  in  the  SeqNBest  framework.  We  use  the  output 
of  category-specific  regressors  of  [Carreira  et  al.  2012]  as  additional  features  to  HIM.  In  the 
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Table  4.2:  As  the  number  of  predictions  is  increased,  we  observe  a  10.60%  gain  in  “oracle” 
accuracy  over  a  single  prediction  on  the  PASCAL  VOC  2012  val  dataset. 


Position 
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3 
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Oracle  acc.  (%) 

42.91 

45.96 
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Figure  4.6:  Qualitative  examples  of  multiple  semantic  scene  segmentations  on  the  PASCAL 
VOC  2012  dataset.  Each  predictor  tries  to  get  right  what  the  previous  predictors  have  not 
been  able  to  cover  well.  For  example  the  cow  grazing  scene  the  first  two  predictors  miss  parts 
of  the  cow  while  the  third  one  gets  majority  of  it  correct. 

first  position  HIM  achieves  42.91%  average  intersection/union  accuracy  over  all  21  classes. 
Table  4.2  shows  the  “oracle”  accuracy  as  the  number  of  predictions  is  increased  to  5  where 
the  “oracle”  accuracy  is  47.46%  which  is  a  10.6%  gain. 

Prasad  et  al.  [Prasad  et  al.  2014],  have  proposed  inference  procedures  for  extracting 
diverse  hypotheses  in  MRFs  using  various  higher-order  potentials  [Delong  et  al.  2012].  This 
is  another  example  of  the  model- dependent  category  of  methods  as  described  in  Chapter  4.1. 
Similar  to  us,  they  have  demonstrated  their  method  on  the  semantic  segmentation  challenge 
in  PASCAL  VOC  2012  val  set.  They  show  impressive  “oracle”  gains  of  ^  12%  over  a  single 
prediction.  Since  their  model  and  code  is  not  yet  available,  it  is  not  currently  possible  to 
directly  compare  against  SeqNBest.  We  use  a  different  model  to  achieve  similar  boosts. 
Again,  this  showcases  the  ease  of  use  and  generality  of  our  approach.  Note  that  we  are  not 
constrained  to  specific  models  or  specific  diversity  terms  which  may  be  only  compatible  with 
particular  model  representations. 

In  ongoing  experiments  we  are  using  recent  advances  in  convolutional  neural  networks 
[Long  et  al.  2014;  Hariharan  et  al.  2014]  as  the  structured  predictor  for  generating  multiple 
segmentations  using  SeqNBest. 
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The  three  main  approaches  presented  until  now  for  predicting  set  and  lists  (ConSeqOpt 
(Chapter  2),  SCP  (Chapter  3),  SeqNBest  (Chapter  4))  rely  on  “reductions”  for  algorithm 
design.  The  main  idea  in  reductions  is  a  simple  but  powerful  one:  given  a  challenging  problem 
for  which  no  obvious  solution  strategies  exist,  break  it  down  into  simpler  problems  with  well- 
understood  theoretical  and  practical  solutions  and  then  relate  performance  on  the  simpler 
problems  to  the  original  problem  of  interest.  Well  known  reductions  include:  Quanting 
[Langford  et  al.  2012]  from  quantile  regression  to  classification;  Probing  [Langford  and 
Zadrozny  2005]  from  squared  loss  regression  to  classification;  Costing  [Zadrozny  et  al.  2003] 
from  importance  weighted  classification  to  binary  classification  by  rejection  sampling;  Searn 
[Daume  Iii  et  al.  2009]  from  structured  prediction  to  binary  classification;  Dagger  [Ross 
et  al.  2011]  imitation  learning  and  structured  prediction  to  no-regret  online  learning.  Such 
reductions  leverage  existing  methods  for  classification,  regression  and  structured  prediction 
and  allow  rapid  progress  to  be  made  on  the  new  task.  Furthermore,  due  to  their  modular 
nature,  if  improved  techniques  for  classification,  regression  or  structured  prediction  become 
available  in  the  future,  they  can  be  readily  plugged  in  without  any  change  in  the  algorithm. 
This  makes  the  approaches  proposed  in  this  work  versatile  and  better  able  to  weather  the 
test  of  time. 

[Yue  and  Joachims  2008]  tackle  the  problem  of  predicting  diverse  sets  of  items  for  infor¬ 
mation  retrieval  settings.  Their  approach  has  two  stages.  In  the  first  stage  a  user  makes  a 
query  and  a  set  of  relevant  documents  are  returned  by  an  oracle  ( e.g .  search  engine).  In  the 
second  stage,  their  approach  then  finds  the  subset  of  documents  which  achieves  maximum 
approximate  coverage  of  subtopics.  This  is  obtained  by  using  the  Greedy  algorithm  under 
the  assumption  that  “covering”  words  in  the  user  query  will  cover  topics  (since  topics  that 
a  document  covers  are  unknown).  In  contrast  ConSeqOpt  and  SCP  don’t  separate  the 
problem  of  list  prediction  into  two  stages,  instead  they  train  classifiers/regressors  to  directly 
predict  the  list  that  is  competitive  with  the  Greedy  algorithm  that  has  perfect  knowledge 
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of  the  user’s  objective.  This  also  bypasses  the  need  for  a  relevance  oracle  like  a  search  engine 
to  find  a  set  of  relevant  items  in  the  first  stage. 

[Yue  and  Guestrin  2011]  tackle  the  problem  of  personal  news  recommendation  where  a 
list  of  personalized  news  articles  are  recommended  to  the  user.  They  assume  a  parameterized 
submodular  reward  function  whose  parameters  are  then  learnt  using  user  interaction  data  in 
a  partial  feedback  setting  using  linear  stochastic  bandit  algorithms  (one  copy  per  position 
of  the  list  similar  to  [Streeter  and  Golovin  2008]  and  ConSeqOpt).  The  main  limitation 
of  this  approach  is  the  realizability  assumption  i.e.  that  the  true  user  interaction  model 
lies  within  their  considered  linear  model  class.  In  contrast  the  approaches  proposed  here 
explicitly  consider  a  submodular  reward  function  over  lists  and  are  agnostic  to  the  reward 
model  of  the  user.  As  a  result  any  feature  space  can  be  used  to  model  the  reward  that  a 
particular  article  brings  to  a  certain  position  in  the  list  instead  of  only  the  submodular  basis 
functions  that  are  used  in  [Yue  and  Guestrin  2011]. 

As  mentioned  before  in  Chapter  4.1  determinantal  point  processes  (DPP’s),  is  a  model 
used  in  particle  physics  for  optimizing  for  diverse  but  low  error  predictions.  Given  a  library  of 
items,  questions  regarding  the  probability  of  subsets  of  these  items  can  be  efficiently  answered. 
Given  a  subset  of  the  items,  the  probability  of  that  subset  is  proportional  to  the  value  of 
the  determinant  of  the  sub-matrix  whose  rows  and  columns  correspond  to  the  items  under 
consideration.  This  can  be  geometrically  interpreted  as  the  volume  of  the  parellopipe  enclosed 
by  the  vectors  representing  those  items.  Vectors  which  are  similar  to  each  other  will  enclose 
less  area  as  opposed  to  vectors  which  are  much  different.  This  naturally  encourages  sets 
consisting  of  diverse  items  to  be  picked.  DPP’s  have  been  used  in  document  summarization, 
pose  estimation  and  other  tasks  where  predicting  lists  is  important  [Kulesza  and  Taskar  2011, 
2010].  But  both  learning  and  inference  in  DPP’s  remains  approximate.  In  contrast  we  take 
the  route  of  directly  optimizing  for  the  objective  at  hand  to  predict  lists. 
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CHAPTER 


Autonomous  UAV  Flight  through  Dense 

Clutter 


Unmanned  Aerial  Vehicles  (UAVs)  have  recently  received  a  lot  of  attention  by  the  robotics 
community.  While  autonomous  flight  with  active  sensors  like  lidars  has  been  well  studied 
[Scherer  et  al.  2008;  Bachrach  et  al.  2009],  flight  using  passive  sensors  like  cameras  has 
relatively  lagged  behind.  This  is  especially  important  given  that  small  UAVs  do  not  have  the 
payload  and  power  capabilities  for  carrying  such  sensors.  Additonally,  most  of  the  modern 
research  on  UAVs  has  focussed  on  flying  at  altitudes  with  mostly  open  space  [Dey  et  al.  2011]. 
Flying  UAVs  close  to  the  ground  through  dense  clutter  [Ross  et  al.  2013a;  Scherer  et  al. 
2008]  has  been  less  explored.  In  this  chapter  we  leverage  the  multiple  prediction  techniques 
developed  in  earlier  chapters  and  apply  them  to  the  problem  of  pure  vision-based  autonomous 
UAV  flight  through  dense  clutter.  We  show  that  using  the  paradigm  of  multiple  predictions 
we  are  able  to  increase  average  flight  length  by  up  to  71%  over  the  single  prediction  case. 

Receding  horizon  control  [Kelly  et  al.  2006]  is  a  classical  deliberative  scheme  commonly 
used  in  autonomous  ground  vehicles  including  five  out  of  the  six  finalists  of  the  DARPA 
Urban  Challenge  [Buehler  et  al.  2008].  Figure  6.2  illustrates  receding  horizon  control  on  our 
UAV  in  motion  capture.  In  receding  horizon  control,  a  pre-selected  set  of  dynamically  feasible 
trajectories  of  fixed  length  (the  horizon),  are  evaluated  on  a  cost  map  of  the  environment 
around  the  vehicle  and  the  trajectory  that  avoids  collision  while  making  most  progress  towards 
a  goal  location  is  chosen.  This  trajectory  is  traversed  for  a  bit  and  the  process  repeated  again. 

We  demonstrate  the  first  receding  horizon  control  with  monocular  vision  implementation 
on  a  UAV.  Figure  6.1  shows  our  quadrotor  evaluating  a  set  of  trajectories  on  the  projected 
depth  image  obtained  from  monocular  depth  prediction  and  traversing  the  chosen  one. 

This  is  motivated  by  our  previous  work  [Ross  et  al.  2013a],  where  we  used  imitation 
learning  to  learn  a  purely  reactive  controller  for  flying  a  UAV  using  only  monocular  vision 
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Figure  6.1:  Example  of  receding  horizon  with  a  quadrotor  using  monocular  vision.  The  lower 
left  images  show  the  view  from  the  front  camera  and  the  corresponding  depth  images  from 
the  monocular  depth  perception  layer.  The  rest  of  the  figure  shows  the  overhead  view  of 
the  quadrotor  and  the  traversability  map  (built  by  projecting  out  the  depth  image)  where 
red  indicates  higher  obstacle  density.  The  grid  is  lxl  m2.  The  trajectories  are  evaluated  on 
the  projected  depth  image  and  the  one  with  the  least  collision  score  (thick  green)  trajectory 
followed. 


Figure  6.2:  Receding  horizon  control  on  UAV  in  motion  capture.  A  library  of  78  trajecories 
of  length  5  m  are  evaluated  to  find  the  best  collision-free  trajectory.  This  is  followed  for  some 
time  and  the  process  repeated. 
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through  dense  clutter.  While  good  obstacle  avoidance  behavior  was  obtained,  there  are 
certain  limitations  of  a  purely  reactive  layer  that  a  more  deliberative  approach  like  receding 
horizon  control  can  ameliorate.  Reactive  control  is  by  definition  myopic,  i.e.,  it  concerns  itself 
with  avoiding  the  obstacles  closest  to  the  vehicle.  This  can  lead  to  it  being  easily  stuck  in 
cul-de-sacs.  Since  receding  horizon  control  plans  for  longer  horizons  it  achieves  better  plans 
and  minimizes  the  chances  of  getting  stuck  [Knepper  and  Mason  2009].  Another  limitation 
of  pure  reactive  control  is  the  difficulty  to  reach  a  goal  location  or  direction.  In  a  receding 
horizon  control  scheme,  trajectories  are  selected  based  on  a  score  which  is  the  sum  of  two 
terms:  first,  the  collision  score  of  traversing  it  and  second,  the  heuristic  cost  of  reaching  the 
goal  from  the  end  of  the  trajectory.  By  weighting  both  these  terms  suitably,  goal-directed 
behavior  is  realized  while  maintaining  obstacle-avoidance  capability.  But  it  is  to  be  noted 
that  reactive  control  can  be  integrated  with  receding  horizon  for  obtaining  the  best  of  both 
worlds  in  terms  of  collision  avoidance  behavior. 

Receding  horizon  control  needs  three  working  components 

1.  A  method  to  estimate  depth :  This  can  be  obtained  from  stereo  vision  [Schmid  et  al. 
2014;  Matthies  et  al.  2014]  or  dense  structure- from-motion  (SfM)  [Wendel  et  al.  2012]. 
But  these  are  not  amenable  for  achieving  higher  speeds  due  to  high  computational 
expense.  We  note  that  in  the  presence  of  enough  computation  power,  information  from 
these  techniques  can  be  combined  with  monocular  vision  to  improve  overall  perception. 

Biologists  have  found  strong  evidence  that  birds  and  insects  use  optical  flow  to  navigate 
through  dense  clutter  [Srinivasan  2011].  Optical  flow  has  been  used  for  autonomous 
flight  of  UAVs  [Beyeler  et  al.  2009].  However,  it  is  difficult  to  directly  derive  a  robust 
control  principle  from  flow.  Instead  we  follow  the  same  data  driven  principle  as  our 
previous  work  [Ross  et  al.  2013a]  and  use  local  statistics  of  optical  flow  as  features  in 
the  monocular  depth  prediction  module.  This  allows  the  learning  algorithm  to  derive 
complex  behaviors  in  a  data  driven  fashion. 

2.  A  method  for  relative  pose  estimation :  To  track  the  trajectory  chosen  at  every  cycle, 
the  pose  of  the  vehicle  must  be  tracked.  We  demonstrate  a  relative  pose  estimation 
system  using  a  downward  facing  camera  and  a  sonar,  which  is  utilized  by  the  controller 
for  tracking  the  trajectory  (Chapter  6.2.5). 

3.  A  method  to  deal  with  perception  uncertainty :  Most  planning  schemes  either  assume 
that  perception  is  perfect  or  make  simplistic  assumptions  of  uncertainty.  We  intro¬ 
duce  the  concept  of  making  multiple,  relevant  yet  diverse  predictions  for  incorporating 
perception  uncertainty  into  planning.  The  intuition  is  predicated  on  the  observation 
that  avoiding  a  small  number  of  ghost  obstacles  is  acceptable  as  long  as  true  obstacles 
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are  not  missed  (high  recall,  low  precision).  The  details  are  presented  in  Chapter  6.2.4 
and  are  related  to  the  methods  developed  in  Chapters  2  and  3.  We  demonstrate  in 
experiments  the  efficacy  of  this  approach  as  compared  to  making  only  a  single  best 
prediction. 

In  summary  our  list  of  contributions  are: 

•  Budgeted  near-optimal  feature  selection  and  fast  non-linear  regression  for  monocular 
depth  prediction. 

•  Real  time  relative  vision-based  pose  estimation. 

•  Multiple  predictions  to  efficiently  incorporate  uncertainty  in  the  planning  stage. 

•  First  complete  receding  horizon  control  implementation  on  a  UAV  with  monocular 
vision. 

6.1  Hardware  and  Software  Overview 

In  this  section  we  describe  the  hardware  platforms  used  in  our  experiments.  Developing  and 
testing  all  the  integrated  modules  of  receding  horizon  is  challenging.  Therefore  we  assembled 
a  rover  (Figure  6.3)  in  addition  to  a  UAV  (Figure  6.3)  to  be  able  to  test  various  modules 
separately.  The  rover  also  facilitated  parallel  development  and  testing  of  modules.  Here  we 
describe  the  hardware  platforms  and  overall  software  architecture. 

6.1.1  Rover 

The  skid-steered  rover  (Figure  6.3)  uses  an  Ardupilot  microcontroller  board  [Ardupilot  2015] 
which  takes  in  high  level  control  commands  from  the  planner  and  controls  four  motors  to 
achieve  the  desired  motion. 

Other  than  the  low-level  controllers,  all  other  aspects  of  the  rover  are  kept  exactly  the 
same  as  the  UAV  to  allow  seamless  transfer  of  software.  For  example,  the  rover  has  a  front 
facing  PlayStation  Eye  color  camera  (640  x  480  at  30Hz)  which  is  also  used  as  the  front  facing 
camera  on  the  UAV. 

A  Bumblebee  color  stereo  camera  pair  (1024  x  768  at  20Hz)  is  rigidly  mounted  with 
respect  to  the  front  camera  using  a  custom  3D  printed  fiber  plastic  encasing.  This  is  used 
for  collecting  data  with  groundtruth  depth  values  (Chapter  6.2)  and  validation  of  planning 
(Section  6.2.6).  We  calibrate  the  rigid  body  transform  between  the  front  camera  and  the 
left  camera  of  the  stereo  pair  using  Bouget’s  camera  calibration  toolbox  [Bouguet  2004]. 
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Figure  6.3:  (Top)  Quadrotor  used  as  our  development  platform.  (Bottom)  Rover  assembled 
with  the  same  control  chips  and  perception  software  as  UAV  for  rapid  tandem  development 
and  validation  of  modules. 
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Stereo  depth  images  and  front  camera  images  are  recorded  simultaneously  while  driving  the 
rover  around  using  a  joystick.  The  depth  images  are  then  transformed  to  the  front  camera’s 
coordinate  system  to  provide  groundtruth  depth  values  for  every  pixel.  The  training  depth 
images  are  from  a  slightly  different  perspective  than  encountered  by  the  UAV  during  flight, 
but  we  found  in  practice  that  the  depth  prediction  modules  generalized  well  to  the  UAV. 
Details  in  Chapter  6.2 

6.1.2  UAV 

Figure  6.3  shows  the  quadrotor  we  use  for  our  experiments.  Figure  6.4  shows  the  schematic  of 
the  various  modules  that  run  onboard  and  offboard.  The  base  chassis,  motors  and  autopilot 
are  assembled  using  the  Arducopter  kit  [Ardupilot  2015].  Due  to  drift  and  noise  of  the  IMU 
integrated  in  the  Ardupilot  unit,  we  added  a  Microstrain  3DM-GX3-25  IMU  which  is  used 
to  aid  real  time  pose  estimation.  There  are  two  cameras:  one  facing  downwards  for  real  time 
pose  estimation  (PlayStation  Eye  color  camera,  320  x  240  at  120Hz)  and  one  facing  forward 
(PointGrey  Chameleon  color  camera  640  x  480  at  30Hz)  for  obstacle  avoidance.  The  onboard 
processor  is  an  Odroid  XU-3  quad-core  ARM  based  small  board  computer  [Odroid  2015] 
which  runs  Ubuntu  14.04  and  ROS  Groovy  [Quigley  et  al.  2009].  This  unit  runs  the  pose 
tracking  and  trajectory  following  modules.  LidarLite,  a  lidar  based  sensor  [LidarLite  2015]  is 
used  to  estimate  altitude.  The  image  stream  from  the  front  facing  camera  is  streamed  to  the 
base  station  where  the  depth  prediction  module  processes  it;  the  trajectory  evaluation  module 
then  finds  the  best  trajectory  to  follow  to  minimize  probability  of  collision  and  transmits  it  to 
the  onboard  computer  where  the  trajectory  following  module  runs  a  pure  pursuit  controller 
to  do  trajectory  tracking  [Coulter  1992].  The  resulting  high  level  control  commands  are  sent 
to  the  Ardupilot  which  sends  low  level  control  commands  to  the  motor  controllers  to  achieve 
the  desired  motion.  In  following  sections  we  describe  each  module  in  detail. 


6.2  Monocular  Depth  Prediction 

In  this  section  we  detail  the  3  depth  prediction  techniques  we  have  developed  and  used  in 
experiments,  preceded  first  by  the  data  collection  methodology. 

6.2.1  Data  Collection 

RGB-D  sensors  like  the  Kinect,  currently  do  not  work  outdoors.  Since  camera  and  calibrated 
nodding  lidar  setup  is  expensive  and  complicated  we  used  a  rigidly  mounted  Bumblebee  stereo 
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Figure  6.4:  Schematic  diagram  of  hardware  and  software  modules 
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color  camera  and  the  PlayStation  Eye  camera  for  our  outdoor  data  collection.  This  setup  was 
mounted  on  the  rover  (Figure  6.3).  We  collected  data  at  different  neighboring  locations  with 
varying  tree  density,  under  varying  illumination  conditions  and  in  both  summer  and  winter 
conditions  (Figure  6.5).  Our  corpus  of  imagery  with  stereo  depth  information  is  around  16000 
images  and  growing.  We  will  make  this  dataset  publicly  available  in  the  near  future. 


(a)  Testing  and  training  areas  near  Carnegie  Mellon  University,  (b)  Additional  training  area  with  higher  density  of  trees  (appro: 
Pittsburgh,  USA.  The  images  below  show  a  couple  of  examples  frommately  one  tree  per  6x6  m2).  Images  below  show  examples  frc 
winter  with  snow  on  the  ground.  The  tree  density  is  approximately  summer, 
one  tree  per  12  x  12  m2  area. 


Figure  6.5:  Testing  and  training  areas. 


6.2.2  Depth  Prediction  by  Fast  Non-linear  Regression 

In  this  section  we  describe  the  depth  prediction  approach  from  monocular  images,  and  the 
fast  non-linear  regression  method  used  for  regression. 

An  image  is  first  gridded  up  into  non-overlapping  patches.  We  predict  the  depth  in  meters 
at  every  patch  of  the  image  (Figure  6.6  yellow  box).  For  each  patch  we  extract  features  which 
describe  the  patch,  features  which  describe  the  full  column  containing  the  patch  (Figure  6.6 
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green  box)  and  features  which  describe  the  column  of  three  times  the  patch  width  (Figure  6.6 
red  box),  centered  around  the  patch.  The  final  feature  vector  for  a  patch  is  the  concatenation 
of  the  feature  vectors  of  all  three  regions.  When  a  patch  is  seen  by  itself  it  is  very  hard  to 
tell  the  relative  depth  with  respect  to  the  rest  of  the  scene.  But  by  adding  the  features  of  the 
surrounding  area  of  the  patch,  more  context  is  available  to  aid  the  predictor  [Divvala  et  al. 
2009;  Oliva  and  Torralba  2007]. 


Figure  6.6:  The  yellow  box  is  an  example  patch,  the  green  box  is  the  column  of  the  same 
width,  and  the  red  box  is  the  column  of  3  times  the  patch  width.  Features  are  extracted 
individually  at  the  patch,  and  the  two  columns.  They  are  concatenated  together  to  form  the 
total  feature  representation  of  the  patch. 


Description  of  features 

In  this  part  we  describe  in  brief  the  features  used  to  represent  the  patch.  We  mainly  borrow 
the  features  as  used  in  previous  work  on  monocular  imitation  learning  [Ross  et  al.  2013a]  for 
UAVs,  which  are  partly  inspired  by  the  work  of  Hoiem  et  al.,  [Hoiem  et  al.  2005]  and  Saxena 
et  al.,  [Saxena  et  al.  2005].  We  predict  the  depth  at  every  patch  using  these  features, which 
is  then  used  by  the  planning  module. 

•  Optical  flow :  We  use  the  Farneback  dense  optical  flow  [Farneback  2003]  implemen¬ 
tation  in  OpenCV  to  compute  for  every  patch  the  average,  minimum  and  maximum 
optical  flow  values.  By  using  optical  flow  statistics  as  features,  temporal  information 
is  also  available  to  the  learning  algorithm.  While  related  works  have  used  optical  flow 
information  directly  to  derive  control  policies  [Srinivasan  2011;  Beyeler  et  al.  2009], 
but  in  practice  we  found  the  estimation  of  flow  itself  to  be  noisy,  even  when  a  very 
computationally  expensive  optimization  based  algorithm  using  a  GPU  is  used  [Wedel 
et  al.  2009].  Additionally  for  the  area  of  the  image  directly  in  front  of  the  UAV,  optical 
flow  is  the  zero  by  definition.  This  makes  distingusing  between  obstacles  and  free  space 
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(a)  Radon  transform  features  help  capture  strong  edges  in  images 


(b)  Histogram  of  oriented  gradients  (HoG)  features  help  capture 
local  gradient  orientation  information 


(c)  We  used  the  per  pixel  hand  detector  of  Li  et  al.  [Li  and  Kitani 
2013]  to  detect  trees  in  images  and  its  output  as  features  to  the 
learning  algorithm  for  predicting  depth 


(d)  Temporal  information  is  maintained  by  computing  optical  flow 
statistics  over  the  image,  statistics  of  which  are  then  used  as  features 
in  the  learning  algorithm  for  depth  prediction. 


Figure  6.7:  Illustration  of  image  features  used  in  the  learning  algorithm  for  monocular  depth 
prediction.  70 


directly  in  front  of  the  UAV  difficult.  Instead  we  have  taken  the  aforementioned  data 
driven  approach  where  the  estimation  of  the  local  scene  depth  is  left  to  the  learning 
algorithm  which  is  aided  by  optical  flow  information  but  is  not  completely  dependent 
on  it.  See  Figure  6.7d. 

•  Radon  Transform :  The  radon  transform  [Helgason  1980]  of  an  image  is  computed  by 
summing  up  the  pixel  values  along  a  discretized  set  of  lines  in  the  image,  resulting  in 
a  2D  matrix  where  the  axes  are  the  two  parameters  of  a  line  in  2D:  angle  6  of  the  line 
and  s  the  distance  along  the  line.  We  discretize  this  matrix  in  to  15  x  15  bins.  For  each 
angle  9  the  two  highest  values  are  recorded.  This  encodes  the  orientations  of  strong 
edges  in  the  image.  See  Figure  6.7a 

•  Structure  Tensor :  At  every  point  in  a  patch  the  structure  tensor  [Harris  and  Stephens 
1988]  is  computed  and  the  angle  between  the  two  eigenvectors  is  used  to  index  in  to  a 
15-bin  histogram  for  the  entire  window.  The  corresponding  eigenvalues  are  accumulated 
in  the  bins.  In  contrast  to  the  radon  transform,  the  structure  tensor  is  a  more  local 
descriptor  of  texture.  Together  with  radon  features  the  texture  gradients  are  captured, 
which  are  strong  monocular  depth  cues  [Wu  et  al.  2004]. 

•  Laws’  Masks :  These  describe  the  texture  intensities  [Davies  2004].  We  use  six  masks 
obtained  by  pairwise  combinations  of  one  dimensional  masks:  (L)evel,  (E)dge  and 
(S)pot.  The  image  is  converted  to  the  YCrCb  colorspace  and  the  LL  mask  is  applied 
to  all  three  channels.  The  remaining  five  masks  are  applied  to  the  Y  channel  only. 
The  results  are  computed  for  each  window  and  the  mean  absolute  value  of  each  mask 
response  is  recorded. 

For  further  details  on  radon  transform,  structure  tensor  and  Laws’  masks  usage  see 
[Ross  et  al.  2013a]. 

•  Histogram  of  Oriented  Gradients  (HoG)\  This  feature  has  been  used  widely  in  the  com¬ 
puter  vision  community  for  capturing  texture  information  for  object  detection  [Dalai 
and  Triggs  2005].  The  HoG  descriptor  computes  the  histogram  of  local  gradient  ori¬ 
entations  over  local  patches  in  an  image.  In  our  implementation  we  compute  the  HoG 
feature  over  9  orientation  bins.  See  Figure  6.7b 

•  Tree  feature :  We  use  the  per  pixel  fast  classifier  by  Li  et  al.  [Li  and  Kitani  2013]  to  train 
a  supervised  tree  detector.  Li  et  al.  originally  used  this  for  real  time  hand  detection 
in  ego-centric  videos.  They  use  a  random  forest  to  predict  whether  each  pixel  in  an 
image  belongs  to  a  human  hand.  We  adapted  this  fast  per  pixel  labeling  method  to 
predict  for  us  the  probability  of  each  pixel  belonging  to  a  tree,  in  an  image  patch.  This 
information  is  then  used  as  a  feature  for  that  patch.  See  Figure  6.7c 
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Fast  Non-linear  Prediction 


Due  to  harsh  real-time  constraints  an  accurate  but  fast  predictor  is  needed.  Recent  linear 
regression  implementations  are  very  fast  and  can  operate  on  millions  of  features  in  real  time 
[Langford  et  al.  2007]  but  are  limited  in  predictive  performance  by  the  inherent  linearity 
assumption.  In  very  recent  work  Agarwal  et  al.  [Agarwal  et  al.  2013]  develop  fast  iterative 
methods  which  use  linear  regression  in  the  inner  loop  to  obtain  overall  non-linear  behavior. 
This  leads  to  fast  prediction  times  while  obtaining  much  better  accuracy.  We  implemented 
Algorithm  2  in  [Agarwal  et  al.  2013]  and  found  that  it  lowered  the  error  by  10%  compared 
to  just  linear  regression,  while  still  allowing  real  time  prediction. 


Budgeted  Feature  Selection 

While  many  different  visual  features  can  be  extracted  on  images,  they  need  to  be  computed  in 
real  time.  The  faster  the  desired  speed  of  the  vehicle,  the  faster  the  perception  and  planning 
modules  have  to  work  to  maintain  safety.  Additionally  the  limited  computational  power 
onboard  a  small  UAV  imposes  a  budget  within  which  to  make  a  prediction.  Each  kind  of 
feature  requires  different  time  periods  to  extract,  while  contributing  different  amounts  to 
the  prediction  accuracy.  For  example,  radon  transforms  might  take  relatively  less  time  to 
compute  but  contribute  a  lot  to  the  prediction  accuracy,  while  another  feature  might  take 
more  time  but  also  contribute  relatively  less  or  vice  versa.  This  problem  is  further  complicated 
by  the  “grouping”  effects  where  a  particular  feature’s  performance  is  affected  by  the  presence 
or  absence  of  other  features. 

Given  a  time  budget,  the  naive  but  obvious  solution  is  to  enumerate  all  possible  combi¬ 
nations  of  features  within  the  budget  and  find  the  group  of  features  which  achieve  minimum 
loss.  This  is  exponential  in  the  number  of  available  features.  Instead  we  use  the  efficient 
approach  developed  by  Hu  et  al.  [Hu  et  al.  2014]  to  select  the  near-optimal  set  of  features 
which  meet  the  imposed  budget  constraints.  Their  approach  uses  a  simple  greedy  algorithm 
that  first  whitens  feature  groups  and  then  recursively  chooses  groups  by  the  reduction  in 
explained  variance  divided  by  the  time  to  achieve  that  reduction.  A  more  efficient  variant  of 
this  with  equivalent  guarantees,  chooses  features  by  computing  gradients  to  approximate  the 
reduction  in  explained  variance,  eliminating  the  need  to  “try”  all  feature  groups  sequentially. 
For  each  specified  time  budget,  the  features  selected  by  this  procedure  are  within  a  constant 
factor  of  the  optimal  set  of  features  which  respect  that  budget.  Since  this  holds  across  all 
time  budgets,  this  procedure  provides  a  recursive  way  to  generate  feature  sets  across  time 
steps. 

Figure  6.8  shows  the  sequence  of  features  that  was  selected  by  Hu  et  al.’s  [Hu  et  al.  2014] 
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feature  selection  procedure.  For  any  given  budget  only  the  features  on  the  left  up  to  the 
specified  time  budget  need  to  be  computed. 
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Figure  6.8:  On  the  upper  x-axis  the  sequence  of  features  selected  by  Hu  et  al.’s  method  [Hu 
et  al.  2014]  and  the  lower  x-axis  shows  the  cumulative  time  taken  for  all  features  up  to  that 
point.  The  near-optimal  sequence  of  features  rapidly  decrease  the  prediction  error.  For  a 
given  time  budget,  the  sequence  of  features  to  the  left  of  that  time  should  be  used. 


Figure  6.9:  Depth  prediction  examples  on  real  outdoor  scenes.  Closer  obstacles  are  indicated 
by  red. 
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6.2.3  Visual  Odometry  based  Depth  Prediction 


Monocular  visual  odometry  (VO)  is  the  process  of  jointly  estimating  the  camera  pose  and  3D 
scene  geometry  given  a  sequence  of  camera  images.  Generally,  feature-based  methods  for  VO 
[Davison  et  al.  2007;  Klein  and  Murray  2007]  consist  of  two  separate  steps:  First,  a  set  of  fea¬ 
ture  observations  is  obtained  from  the  given  image.  Second,  camera  pose  and  scene  geometry 
are  obtained  as  a  function  of  these  features  only.  While,  this  abstraction  greatly  reduces  the 
complexity  of  the  problem,  it  comes  with  several  drawbacks.  While  feature-based  methods 
allow  us  to  estimate  camera  pose  in  real-time,  the  resulting  feature  based  maps  provide  a  very 
sparse  representation  of  the  scene  geometry  to  have  any  reliable  collision  avoidance.  Only  im¬ 
age  information  conforming  to  the  respective  feature  type  and  parametrization  aA§  typically 
image  corners  and  blobs  or  line  segments  aA§  is  utilized. To  overcome  this  limitation,  direct 
approaches  [Stiihmer  et  al.  2010;  Wendel  et  al.  2012;  Templeton  2009;  Geyer  et  al.  2006]  for 
scene  geometry  reconstruction  have  become  increasingly  popular  in  the  last  few  years. 


Direct  Methods  for  VO 

Instead  of  operating  solely  on  visual  features,  direct  methods  directly  work  on  the  images 
instead  of  a  set  of  extracted  features,  for  both  mapping  and  tracking:  The  world  is  modeled 
as  a  dense  surface  while  in  turn  new  frames  are  tracked  using  whole-image  alignment.  This 
concept  removes  the  need  for  discrete  features,  and  allows  the  exploitation  of  all  information 
present  in  the  image.  In  addition  to  higher  accuracy  and  robustness,  in  environments  with 
little  interesting  points  to  extract  features  on,  this  provides  substantially  more  information 
about  the  geometry  of  the  environment.  We  utilize  the  framework  of  [Engel  et  al.  2013]:  a 
method  for  semi-dense  direct  depth  map  estimation  i.e.  a  dense  depth  map  covering  all  image 
regions  with  non- negligible  gradient. 


Semi-dense  Depth  Map  Estimation 

The  depth  measurements  are  obtained  by  [Engel  et  al.  2013] ’s  proposed  probabilistic  approach 
for  adaptive-baseline  stereo  .  This  method  explicitly  takes  into  account  the  knowledge  that  in 
video,  small  baseline  frames  occurs  before  large  baseline  frames.  A  subset  of  pixels  is  selected 
for  which  the  disparity  is  sufficiently  large  and  for  each  selected  pixel  a  suitable  reference 
frame  is  selected.  A  one  dimensional  disparity  search  is  performed.  The  obtained  disparity  is 
converted  to  an  inverse-depth  representation,  where  the  inverse  depth  is  directly  proportional 
to  the  disparity.  The  map  is  then  updated  using  this  inverse  depth  estimate. 

The  inverse  depth  map  is  propagated  from  to  subsequent  frames,  once  the  pose  of  the 
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Figure  6.10:  Semi-dense  Depth  Map  Estimation,  (a)  Sample  image  and  (b)  Inverse  Depth 
Map  (red  is  near,  blue  is  far)  of  a  sample  image  during  flight  as  generated  by  monocular 
visual  odometry  based  approach.  It  is  to  be  noted  that  only  reliable  depths  are  propagated 
and  the  rest  are  discarded  (black  regions),  hence  resulting  in  a  semi-dense  representation. 
Note:  Best  seen  in  color. 


following  frames  have  been  determined  and  refined  with  new  stereo  depth  measurements. 
Based  on  the  inverse  depth  estimate  do  for  the  pixel,  the  corresponding  3D  point  is  calculated 
and  projected  into  the  new  frame  and  assigned  to  the  closest  integer  pixel  position  providing 
the  new  inverse  depth  estimate  d\.  We  assume  the  camera  rotation  to  be  small,  thus  the  new 
inverse  depth  map  can  be  approximated  by 

di(do)  =  (do1  -tz)~x, 


where  tz  is  the  camera  translation  along  the  optical  axis.  Now,  for  each  frame,  after  the  depth 
map  has  been  updated,  a  regularization  step  is  performed  by  assigning  each  inverse  depth 
value  the  average  of  the  surrounding  inverse  depths,  weighted  by  their  respective  inverse 
variance  (a2).  An  example  of  the  obtained  depth  estimates  has  been  shown  in  Figure  6.10 
Note:  In  order  to  prevent  sharp  edges,  which  can  be  critical  in  detecting  trees,  we  only 
perform  this  step  if  two  adjacent  depth  values  are  statistically  similar  i.e.  their  variances  are 
within  2d. 


Dense  Tracking 

We  represent  an  image  as  /  :  Q  — >>  R,  the  inverse  depth  map  and  inverse  depth  variance 
map  as  D  :  Qp  R+  and  V  :  fto  R+,  where  Qp  contains  all  pixels  which  have  a  valid 
depth  hypothesis.  Note  that  D  and  V  denote  mean  and  variance  of  the  inverse  depth,  as  this 
approximates  the  uncertainty  of  stereo  much  better  than  assuming  a  Gaussian-distributed 
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depth  [Montiel  et  al.  2006]. 

Given  a  semi-dense  inverse  depth  map  for  the  current  image,  the  camera  pose  of  the  new 
frames  is  estimated  using  direct  image  alignment:  given  the  current  map  {Im,  Dm,  Vm},  the 
relative  pose  £  G  SE( 3)  of  a  new  frame  I  is  obtained  by  directly  minimizing  the  photometric 
error 

E(0-=  E  \\lM(x)-I(w(x,Dm(x),0)\\s, 

XeQDM 

where  w  :  £Idm  x  R  x  SE(3)  — »  uo  projects  a  point  from  the  reference  frame  image  into  the 
new  frame  and  ||  •  \\s  is  the  Huber  norm  to  account  for  outliers.  The  minimum  is  computed 
using  iteratively  re- weighted  Levenberg-Marquardt  minimization  [Engel  et  al.  2014]. 


Scale  Estimation 

Scale  ambiguity  is  inherent  to  all  monocular  visual  odometry  based  methods.  This  is  not 
critical  in  visual  mapping  tasks,  where  the  external  scale  can  be  obtained  using  either  fiducial 
markers  [Daftry  et  al.  2015],  or  known  dimension  of  objects  in  the  scene  as  a  post  processing 
step.  However,  for  osbtacle  avoidance  in  real-time,  it  is  required  to  accurately  recover  the 
current  scale  so  that  the  distance  to  the  object  is  known  in  real  world  units.  We  resolve  the 
absolute  scale  A  G  R+  by  leveraging  motion  estimation  from  a  highly  accurate  single  beam 
laser  lite  sensor  [LidarLite  2015]  onboard.  We  measure,  at  regular  intervals  (operating  at 
15Hz),  the  3-dimensional  distance  travelled  according  to  the  visual  odometry  G  R3  and 
the  metric  sensors  yi  G  R3.  Given  such  sample  pairs  (x^,yj,  we  obtain  a  scale  A (ti)  G  R  as 

the  running  arithmetic  average  of  the  quotients  }44  over  a  small  window  size.  We  further  pass 

ii  y  z  ii 

the  obtained  set  of  scale  measurements  through  a  low-pass  filter  in  order  to  avoid  erroneous 
measurements  due  to  sensor  noise.  The  true  scale  A  is  thus  obtained  and  used  to  scale  the 
depth  map  to  real  world  units. 

6.2.4  Multiple  Predictions 

The  monocular  depth  estimates  are  often  noisy  and  inaccurate  due  the  challenging  nature 
of  the  problem.  A  planning  system  must  incorporate  this  uncertainty  to  achieve  safe  flight. 
Figure  6.11  illustrates  the  difficulty  of  trying  to  train  a  predictive  method  for  building  a 
perception  system  for  collision  avoidance.  Figure  6.11  (left)  shows  a  ground  truth  location  of 
trees  in  the  vicinity  of  an  autonomous  UAV.  Figure  6.11  (middle)  shows  the  location  of  the 
trees  as  predicted  by  the  perception  system.  In  this  prediction  the  trees  on  the  left  and  far 
away  in  front  are  predicted  correctly  but  the  tree  on  the  right  is  predicted  close  to  the  UAV. 
This  will  cause  the  UAV  to  dodge  a  ghost  obstacle.  While  this  is  bad,  it  is  not  fatal  because 
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the  UAV  will  not  crash  but  make  some  extraneous  motions.  But  the  prediction  of  trees  in 
Figure  6.11  (right)  is  potentially  fatal.  Here  the  trees  far  away  in  front  and  on  the  right  are 
correctly  predicted  where  as  the  tree  on  the  left  originally  close  to  the  UAV  is  mis-predicted 
to  be  far  away.  This  type  of  mistake  will  cause  the  UAV  to  crash  into  an  obstacle  it  does  not 
know  is  there. 


Groundtruth  Bad  Prediction  Fatal  Prediction 

Figure  6.11:  Illustration  of  the  complicated  nature  of  the  loss  function  for  collision  avoidance. 
(Left)  Groundtruth  tree  locations  (Middle)  Bad  prediction  where  a  tree  is  predicted  closer 
than  it  actually  is  located  (Right)  Fatal  prediction  where  a  tree  close  by  is  mispredicted 
further  away. 

Ideally,  a  vision-based  perception  system  should  be  trained  to  minimize  loss  functions 
which  will  penalize  such  fatal  predictions  more  than  other  kind  of  predictions.  But  even 
writing  down  such  a  loss  function  is  difficult.  Therefore  most  monocular  depth  perception 
systems  try  to  minimize  easy  to  optimize  surrogate  loss  functions  like  regularized  L\  or  L2 
loss  [Saxena  et  al.  2005].  We  try  to  reduce  the  probability  of  collision  by  generating  multiple 
interpretations  of  the  scene  to  hedge  against  the  risk  of  committing  to  a  single  potentially 
fatal  interpretation  as  illustrated  in  Figure  6.11.  Specifically  we  generate  3  interpretations  of 
the  scene  and  evaluate  the  trajectories  in  all  3  interpretations  simultaneously.  The  trajectory 
which  is  least  likely  to  collide  on  average  in  all  interpretations  is  then  chosen  as  the  trajectory 
to  traverse. 

One  way  of  making  multiple  predictions  is  to  just  sample  the  posterior  distribution  of  a 
learnt  predictor.  In  order  to  truly  capture  the  uncertainty  of  the  predictor,  a  lot  of  interpre¬ 
tations  have  to  be  sampled  and  trajectories  evaluated  on  each  of  them.  A  large  number  of 
samples  will  be  from  around  the  peaks  of  this  distribution  leading  to  wasted  samples.  This 
is  not  feasible  given  the  real  time  constraints  of  the  problem. 
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In  previous  chapters  (Chapters  2  and  3),  we  have  developed  techniques  for  predicting 
a  budgeted  number  of  interpretations  of  an  environment  with  applications  to  manipulation, 
planning  and  control.  Batra  et  ah,  [Batra  et  al.  2012]  have  also  applied  similar  ideas  to 
structured  prediction  problems  in  computer  vision.  These  approaches  try  to  come  up  with 
a  small  number  of  relevant  but  diverse  interpretations  of  the  scene  so  that  at  least  one  of 
them  is  correct.  In  this  work,  we  adopt  a  similar  philosophy  and  use  the  error  profile  of  the 
fast  non-linear  regressor  described  in  Chapter  6.2  to  make  two  additional  predictions:  The 
non-linear  regressor  is  first  trained  on  a  dataset  of  14500  images  and  its  performance  on  a 
held-out  dataset  of  1500  images  is  evaluated.  For  each  depth  value  predicted  by  it,  the  average 
over-prediction  and  under-prediction  error  is  recorded.  For  example  the  predictor  may  say 
that  an  image  patch  is  at  3  meters  while  it  is  actually  either,  on  average,  at  4  meters  or  at 

2.5  meters.  We  round  each  prediction  depth  to  the  nearest  integer,  and  record  the  average 
over  and  under-predictions  as  in  the  above  example  in  a  look-up  table  (LUT).  At  test  time 
the  predictor  produces  a  depth  map  and  the  LUT  is  applied  to  this  depth  map,  producing 
two  additional  depth  maps:  one  for  over-prediction  error,  and  one  for  the  under-prediction 
error. 

Similarly  for  the  Direct  VO  based  depth  image  prediction,  we  make  multiple  predictions 
by  utilizing  the  variance  of  the  estimated  inverse  depth  which  is  already  calculated  in  the 
framework  of  [Engel  et  al.  2013].  At  every  pixel  the  variance  of  the  inverse  depth  is  used  to 
find  the  inverse  depth  value  one  standard  deviation  away  from  the  mean  (both  lower  than 
and  higher  than  the  mean  value)  and  inverted  to  obtain  a  depth  value.  So  as  before  a  total 
of  3  depth  predictions  are  made:  1)  mean  depth  estimate  2)  depth  estimate  at  one  standard 
deviation  lower  than  the  mean  depth  at  every  pixel  and  3)  depth  estimate  at  one  standard 
deviation  greater  than  the  mean  depth  at  every  pixel. 

Figure  6.12  shows  an  example  in  which  making  multiple  predictions  is  clearly  beneficial 
compared  to  the  single  best  interpretation  (using  the  non-linear  regression  depth  estimation 
method).  We  provide  more  experimental  details  and  statistics  in  Chapter  6.3. 

6.2.5  Pose  Estimation 

As  discussed  before,  a  relative  pose-estimation  system  is  needed  to  follow  the  trajectories 
chosen  by  the  planning  layer.  We  use  a  downward  looking  camera  in  conjunction  with  a 
downward  facing  single  beam  lidar  [LidarLite  2015]  for  determining  relative  pose.  Looking 
forward  to  determine  pose  is  ill-conditioned  due  to  a  lack  of  parallax  as  the  camera  faces 
the  direction  of  motion.  There  are  still  significant  challenges  involved  when  looking  down. 
Texture  is  often  very  self  similar  making  it  challenging  for  traditional  feature  based  methods 
[Newcombe  et  al.  2011;  Klein  and  Murray  2007]  to  be  employed. 
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Single  best  prediction 


Multiple  predictions 


Trajectory  Selected 

Figure  6.12:  The  scene  at  top  is  an  example  from  the  front  camera  of  the  UAV.  On  the  left 
is  shown  the  predicted  traversability  map  (red  is  high  cost,  blue  is  low  cost)  resulting  from  a 
single  interpretation  of  the  scene.  Here  the  UAV  has  selected  the  straight  path  (thick,  green) 
which  will  make  it  collide  with  the  tree  right  in  front.  While  on  the  right  the  traversability 
map  is  constructed  from  multiple  interpretations  of  the  image,  leading  to  the  trajectory  in 
the  right  being  selected  which  will  make  the  UAV  avoid  collision. 

In  receding  horizon,  absolute  pose  with  respect  to  some  fixed  world  coordinate  system 
is  not  needed,  as  one  needs  to  follow  trajectories  for  short  durations  only.  So  as  long  as 
one  has  a  relative,  consistent  pose  estimation  system  for  this  duration  (3  seconds  in  our 
implementation),  one  can  successfully  follow  trajectories. 

We  used  a  variant  of  a  simple  algorithm  that  has  been  presented  quite  often,  most  recently 
in  [Honegger  et  al.  2013].  This  approach  uses  a  Kanade-Lucas-Tomasi  (KLT)  tracker  [Tomasi 
and  Kanade  1991]  to  detect  where  each  pixel  in  a  grid  of  pixels  moves  over  consecutive  frames, 
and  estimating  the  mean  flow  from  these  after  rejecting  outliers.  We  do  the  outlier  detection 
step  by  comparing  the  variation  of  the  flow  vectors  obtained  for  every  pixel  on  the  grid  to  a 
specific  threshold.  Whenever  the  variance  of  the  flow  is  high,  we  do  not  calculate  the  mean 
flow  velocity,  and  instead  decay  the  previous  velocity  estimate  by  a  constant  factor. 

This  estimate  of  flow  however  tries  to  find  the  best  planar  displacement  between  the  two 
patches,  and  does  not  take  into  account  out-of-plane  rotations,  due  to  motion  of  the  camera. 
Camera  ego-motion  is  compensated  using  motion  information  from  the  IMU.  Finally  the 
metric  scale  is  estimated  from  sonar.  We  compute  instantaneous  relative  velocity  between 
the  camera  and  ground  which  is  integrated  over  time  to  get  position. 

This  process  is  computationally  inexpensive,  and  can  be  run  at  very  high  frame  rates. 
Higher  frame  rates  lead  to  smaller  displacements  between  pairs  of  images,  which  in  turn 
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Figure  6.13:  The  overall  flow  of  data  and  control  commands  between  various  modules.  The 
pure  pursuit  trajectory  follower  and  low-level  controller  (purple  boxes)  are  shown  in  greater 
detail  at  the  bottom. 


makes  tracking  easier. 

We  evaluated  the  peformance  of  the  flow  based  tracker  in  motion  capture  and  compared 
the  true  motion  capture  tracks  to  the  tracks  returned  by  flow  based  tracker.  The  resulting 
tracks  are  as  shown  in  Figure  6.14 


6.2.6  Planning  and  Control 

Figure  6.13  shows  the  overall  flow  of  data  and  control  commands.  The  front  camera  video 
stream  is  fed  to  the  perception  module  which  predicts  the  depth  of  every  pixel  in  a  frame, 
projects  it  to  a  point  cloud  representation  and  sends  it  to  the  receding  horizon  control  module. 
A  trajectory  library  of  78  trajectories  of  length  5  meters  is  budgeted  and  picked  from  a  much 
larger  library  of  2401  trajectories  using  the  maximum  dispersion  algorithm  by  Green  et  al. 
[Green  and  Kelly  2006].  This  is  a  greedy  procedure  for  selecting  trajectories,  one  at  a  time, 
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Figure  6.14:  Comparison  of  the  differential  flow  tracker  performance  vs  ground  truth  in 
motion  capture.  Yellow  tracks  are  the  true  trajectories  as  determined  by  the  very  accurate 
motion  capture  system,  green  are  those  determined  by  the  algorithm.  Note  that  due  to 
constant  replanning  every  3  second,  small  drift  in  following  a  specific  trajectory  can  be  easily 
tolerated.  So  as  long  as  the  drift  is  not  more  than  a  few  centimeters  over  a  trajectory,  collision 
avoidance  is  not  compromised. 

so  that  each  subsequent  trajectory  spans  maximum  area  between  it  and  the  rest  of  the 
trajectories.  The  receding  horizon  module  maintains  a  score  for  every  point  in  the  point 
cloud.  The  score  of  a  point  decays  exponentially  the  longer  it  exists.  After  some  time  when 
it  drops  below  a  user  set  threshold,  the  point  is  deleted.  The  decay  rate  is  specified  by  setting 
the  time  constant  of  the  decaying  function.  This  fading  memory  representation  of  the  local 
scene  layout  has  two  advantages:  1)  It  prevents  collisions  caused  by  narrow  field-of-view 
issues  where  the  quadrotor  forgets  that  it  has  just  avoided  a  tree,  sees  the  next  tree  and 
dodges  sideways,  crashing  into  the  just  avoided  tree.  2)  It  allows  emergency  backtracking 
maneuvers  to  be  safely  executed  if  required,  since  there  is  some  local  memory  of  the  obstacles 
it  has  just  passed. 

Our  system  accepts  a  goal  direction  as  input  and  ensures  that  the  vehicle  makes  progress 
towards  the  goal  while  avoiding  obstacles  along  the  way.  The  score  for  each  trajectory  is 
the  sum  of  three  terms:  1)  A  sphere  of  the  same  radius  as  the  quadrotor  is  convolved  along 
a  trajectory  and  the  score  of  each  point  in  collision  is  added  up.  The  higher  this  term  is 
relative  to  other  trajectories,  the  higher  the  likelihood  of  this  trajectory  being  in  collision.  2) 
A  term  which  penalizes  a  trajectory  whose  end  direction  deviates  from  goal  direction.  This 
is  weighted  by  a  user  specified  parameter.  This  term  induces  goal  directed  behavior  and  is 
tuned  to  ensure  that  the  planner  always  avoids  obstacles  as  a  first  priority.  3)  A  term  which 
penalizes  a  trajectory  for  deviating  in  translation  from  the  goal  direction. 

The  pure  pursuit  controller  module  (Figure  6.13)  takes  in  the  coordinates  of  the  trajectory 
to  follow  and  the  current  pose  of  the  vehicle  from  the  optical  flow  based  pose  estimation  system 
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(Chapter  6.2.5).  We  use  a  pure  pursuit  strategy  [Coulter  1992]  to  track  it.  Specifically, 
this  involves  finding  the  closest  point  on  the  trajectory  from  the  robot’s  current  estimated 
position  and  setting  the  target  waypoint  to  be  a  certain  fixed  lookahead  distance  further 
along  the  trajectory.  The  lookahead  distance  can  be  tuned  to  obtain  the  desired  smoothness 
while  following  the  trajectory;  a  larger  lookahead  distance  leads  to  smoother  motions,  at  the 
cost  of  not  following  the  trajectory  exactly.  Using  the  pose  updates  provided  by  the  pose 
estimation  module,  we  head  towards  this  moving  waypoint  using  a  generic  PD  controller. 
Since  the  receding  horizon  control  module  continuously  replans  (at  5  hz)  based  on  the  image 
data  provided  by  the  front  facing  camera,  we  can  choose  to  follow  arbitrary  lengths  along  a 
particular  trajectory  before  switching  over  to  the  latest  chosen  one. 


Validation  of  Modules 


We  validated  each  module  separately  as  well  as  in  tandem  with  other  modules  where  each 
validation  was  progressively  integrated  with  other  modules.  This  helped  reveal  bugs  and 
instabilities  in  the  system. 
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Time 
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Complete  Deliberative 
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Figure  6.15:  Development  and  testing  of  different  modules  occured  in  parallel.  The  depth 
estimation  and  planning  modules  were  developed  and  tested  on  the  rover  while  the  pose 
estimation  and  control  modules  required  for  following  trajectories  were  developed  and  tested 
on  the  UAV.  Hardware-in-the-loop  (HWIL)  testing  of  all  planning  and  control  modules  was 
done  on  the  UAV  before  perception  and  planning  modules  from  the  rover  were  integrated 
and  then  tested  on  the  UAV  for  complete  autonomous  flight  through  clutter. 


•  Trajectory  Evaluation  and  Pure  Pursuit  Validation  with  Stereo  Data  on  Rover :  We 
tested  the  trajectory  evaluation  and  pure  pursuit  control  module  by  running  the  entire 
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pipeline  (other  than  monocular  depth  prediction)  with  stereo  depth  images  on  the  rover. 
Figure  6.16. 

•  Trajectory  Evaluation  and  Pure  Pursuit  Validation  with  Monocular  Depth  on  Rover : 
This  test  is  the  same  as  above  but  instead  of  using  depth  images  from  stereo  we  used 
the  monocular  depth  prediction.  This  allowed  us  to  tune  the  parameters  for  scoring 
trajectories  in  the  receding  horizon  module  to  head  towards  goal  without  colliding  with 
obstacles. 


•  Trajectory  Evaluation  and  Pure  Pursuit  Validation  with  Known  Obstacles  in  Motion 
Capture  on  UAV :  While  testing  of  modules  progressed  on  the  rover  we  assembled  and 
developed  the  pose  estimation  module  (Chapter  6.2.5)  for  the  UAV.  We  tested  this 
module  in  a  motion  capture  lab  where  the  position  of  the  UAV  as  well  of  the  obstacles 
was  known  and  updated  at  120  Hz.  (See  Figure  6.2) 

•  Trajectory  Evaluation  and  Pure  Pursuit  Validation  with  Hardware-in-the-Loop  (HWIL)\ 
In  this  test  we  ran  the  UAV  in  an  open  field,  fooled  the  receding  horizon  module  to 
think  it  was  in  the  midst  of  a  point  cloud  and  ran  the  whole  system  (except  perception) 
to  validate  planning  and  control  modules.  Figure  6.17  shows  an  example  from  this 
setup. 


•  Whole  System :  After  validating  each  module  following  the  evaluation  protocol  described 
above,  we  ran  the  whole  system  end-to-end.  Figure  6.1  shows  an  example  scene  of  the 
quadrotor  in  full  autonomous  mode  avoiding  trees  outdoors.  We  detail  the  results  of 
collision  avoidance  in  Section  6.3. 


Figure  6.16:  Receding  horizon  control  validation  with  rover  using  depth  images  from  stereo. 
The  bright  green  trajectory  is  the  currently  selected  trajectory  to  follow.  Red  trajectories 
indicate  that  they  are  more  likely  to  be  in  collision. 
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Figure  6.17:  Hardware-in-the-loop  testing  with  UAV  in  open  field.  The  receding  horizon 
module  was  fooled  into  thinking  that  it  was  in  the  midst  of  a  real  world  point  cloud  while  it 
planned  and  executed  its  way  through  it.  This  allowed  us  to  validate  planning  and  control 
without  endangering  the  UAV. 

6.3  Experiments 

In  this  section  we  analyze  the  performance  of  the  deliberative  approach  using  the  three  depth 
prediction  approaches  detailed  and  discuss  pros  and  cons  of  each  method.  All  experiments 
were  conducted  in  a  wooded  area  with  dense  trees  and  a  light-weight  tether  to  the  UAV  for 
safety.  Figure  6.5  shows  the  location  of  the  test  site  (Schenley  Park,  near  Carnegie  Mellon 
University,  Pittsburgh,  PA).  There  is  approximately  1  tree  per  12  x  12  m2  in  this  area. 

We  separately  evaluate  performance  of  the  perception  module  and  the  ability  of  the  entire 
system  to  fly  in  dense  clutter. 

6.3.1  Perception  Evaluation 

Figure  6.18  shows  the  average  depth  error  against  depth  images  obtained  from  stereo  pro¬ 
cessing.  The  average  error  values  for  non-linear  regression  are  obtained  from  a  held-out  set 
of  100  images.  We  readily  observe  that  direct  visual  odometry  performs  really  well  with  low 
error  values  up  to  [15,  20]  m.  Please  note  that  direct  visual  odometry  is  not  a  learning  based 
method  while  non-linear  regression  is  trained  on  a  dataset  of  stereo  depth  images.  This  graph 
nevertheless  serves  to  show  the  accuracy  of  the  visual  odometry  based  method. 

6.3.2  System  Performance  Evaluation 

We  evaluate  performance  by  recording  the  average  distance  flown  autonomously  by  the  UAV 
over  several  runs,  before  an  intervention.  An  intervention,  in  this  context,  is  defined  as  the 
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Figure  6.18:  Average  root-mean-squared-error  (RMSE)  binned  over  groundtruth  depth  buck¬ 
ets  of  [1,5]  m,  [5,10]  m,  etc.  Groundtruth  depth  images  are  obtained  from  stereo  image 
processing. 
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Figure  6.19:  Average  flight  distance  per  intervention.  For  each  method,  the  corresponding 
multiple  prediction  variant  performs  significantly  better.  We  also  plot  the  performance  of 
the  pure  reactive  approach  from  [Ross  et  al.  2013a]  as  a  baseline  to  highlight  that  being 
deliberative  significantly  raises  collision  avoidance  performance. 
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pilot  overriding  the  autonomous  system  to  prevent  the  UAV  from  an  inevitable  crash.  Ex¬ 
periments  were  performed  using  both  the  multiple  predictions  approach  and  single  best  pre¬ 
diction,  for  each  depth  prediction  method.  Figure  6.19  shows  the  average  distance  traversed 
per  intervention  and  Figure  6.20  shows  the  average  number  of  trees  avoided  by  the  UAV  per 
intervention  using  each  perception  method.  We  make  several  immediate  observations: 

•  Multiple  predictions  gives  a  significant  boost  to  the  average  flight  distance 
for  each  depth  prediction  method  over  corresponding  single  prediction  ap¬ 
proach..  In  the  case  of  non-linear  regression  based  depth  prediction,  making  multiple 
predictions  gives  an  135%  improvement,  and  in  the  case  of  direct  visual  odometry  the 
improvement  is  71%.  This  validates  our  intuition  from  Chapter  6.2.4  that  by  avoiding 
a  small  number  of  extra  ghost  obstacles,  we  can  significantly  reduce  crashes  due  to 
uncertainty. 

•  Deliberative  approaches  perform  much  better  than  pure  reactive  obstacle 
avoidance.  This  is  not  surprising  since  by  definition  deliberative  approaches  can  reason 
further  and  make  better  decisions  than  reactive  approaches.  Direct  visual  odometry 
with  multiple  predictions  can  fly,  on  average  3.48  times  longer  than  pure  reactive  control 
used  in  our  previous  work  [Ross  et  al.  2013a]. 

•  Direct  visual  odometry  based  depth  prediction  significantly  outperforms 
non-linear  regression  based  depth  prediction  and  pure  reactive  control. 

While  this  is  not  surprising  given  the  accuracy  of  the  depth  maps  produced  by  di¬ 
rect  visual  odometry  in  Chapter  6.2.3,  it  is  surprising  that  even  when  we  are  moving 
forward,  which  is  the  direction  of  least  parallax,  good  depth  maps  can  be  realized. 
This  can  be  attributed  to  the  fact  that  geometric  constraints  provided  by  the  ground 
and  trees  on  the  periphery  of  the  field  of  view  over  multiple  frames  provide  enough 
contraints  for  an  accurate  depth  map. 

•  Overall  with  our  best  performing  approach  (Direct  visual  odometry  with 
multiple  predictions  we  can  fly  516  meters  on  average  before  an  intervention) 

In  Figure  6.21  failures  are  broken  down  by  the  type  of  obstacle  the  UAV  failed  to  avoid, 
or  whether  the  obstacle  was  not  in  the  field-of-view  (FOV)  for  the  non-linear  regression  based 
depth  prediction  method  (both  single  best  and  multiple  prediction  approaches).  Overall,  39% 
of  the  failures  were  due  to  large  trees  and  33%  on  hard  to  perceive  obstacles  like  branches  and 
leaves.  As  expected,  failures  due  to  FOV  issues  are  now  the  least  contributor  to  overall  inter¬ 
ventions  compared  to  the  reactive  control  strategy  [Ross  et  al.  2013a]  which  had  29.3%  FOV 
related  interventions,  while  deliberative  approach  has  only  9%.  This  is  intuitive,  since  the 
reactive  control  is  myopic  in  nature  and  our  deliberate  approach  helps  overcome  this  problem 
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Figure  6.20:  Average  number  of  trees  avoided  by  the  UAV  per  intervention.  This  provides 
an  additional  idea  of  the  density  of  the  test  site. 


Figure  6.21:  Percentage  of  failure  for  each  type.  Red:  Large  Trees,  Yellow:  Thin  Trees,  Blue: 
Foliage,  Green:  Narrow  FOV. 

as  described  in  previous  sections.  Figure  6.22  shows  some  typical  intervention  examples. 


6.4  Conclusion 

While  we  have  obtained  promising  results,  a  number  of  challenges  remain:  better  handling  of 
sudden  strong  wind  disturbances  and  control  schemes  for  leveraging  the  full  dynamic  envelope 
of  the  vehicle.  In  ongoing  work  we  are  moving  towards  complete  onboard  computing  of  all 
modules  to  reduce  latency.  We  can  leverage  other  sensing  modes  like  sparse,  but  more 
accurate  depth  estimation  from  stereo,  which  can  be  used  as  “anchor”  points  to  improve 
dense  monocular  depth  estimation.  Similarly  low  power,  light  weight  lidars  can  be  actively 
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Figure  6.22:  Examples  of  interventions:  (Left)  Bright  trees  saturated  by  sunlight  from  behind 
(Second  from  left)  Thick  foliage  (Third  from  left)  Thin  trees  (Right)  Flare  from  direct  sun¬ 
light.  Camera/lens  with  higher  dynamic  range  and  more  data  of  rare  classes  should  improve 
performance. 
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obstacle  regions  to  reduce  false  positives  and  get  exact  depth, 
is  to  integrate  the  purely  reactive  [Ross  et  al.  2013a]  approach 
detailed  here,  for  better  performance. 


CHAPTER 


Open  Issues  and  Future  Work 


In  this  work  we  have  proposed  methods  for  predicting  lists  which  maximize  monotone  sub- 
modular  objectives.  Recent  advances  in  literature  [Gupta  et  al.  2010;  Buchbinder  et  al.  2014; 
Pan  et  al.  2014]  have  proposed  methods  for  optimizing  non-monotone  submodular  objectives 
in  the  context-free  setting.  Similar  reductions  as  demonstrated  in  this  work  can  be  harnessed 
for  contextualizing  the  optimization  of  non-monotone  submodular  objectives  as  well.  This 
will  find  ready  use  in  robotics,  computer  vision  and  general  decision-making  scenarios. 

In  all  the  methods  presented  here,  we  have  assumed  that  executing  the  list  of  actions 
does  not  change  the  current  environment.  A  concrete  example  of  this  is  to  consider  the 
manipulation  planning  case  study  2.1  where  a  list  of  initial  seed  trajectories  is  proposed  by 
ConSeqOpt  or  SCP  for  evaluation  by  the  manipulator.  As  the  manipulator  executes  this 
list  on  the  environment,  it  is  possible  that  it  will  modify  the  environment  by  unintentionally 
moving  the  objects.  This  changes  the  environment  and  affects  the  probability  of  success  of 
subsequent  seed  trajectories.  If  the  change  is  large  it  invalidates  the  performance  guarantees 
of  the  list. 

In  ongoing  work  we  are  exploring  connections  to  adaptive  submodularity  [Golovin  and 
Krause  2010]  and  imitation  learning  [Ross  et  al.  2011]  to  produce  decision  policies  which 
observe  the  result  of  the  current  action  and  then  suggest  the  next  action  to  try. 

Another  limitation  we  would  like  to  highlight  is  our  methods  can’t  propose  novel  actions 
that  are  not  present  in  the  library.  Muelling  et  al.  [Muelling  et  al.  2010]  generalize  actions 
already  present  in  a  motion  library  to  synthesize  new  actions  in  an  online  manner  which  are 
more  suited  for  the  task  at  hand.  By  incorporating  such  action  synthesis  methods  into  our 
framework  we  can  propose  better  action  lists  for  the  task  at  hand. 

In  the  case  of  multiple  structured  output  prediction  (Section  4,  it  might  be  possible  to 
modify  the  training  procedure  of  Structured-SVMs  (specifically  the  decoding  stage)  [Tsochan- 
taridis  et  al.  2005;  Ratliff  et  al.  2007c]  for  getting  the  exact  same  performance  guarantees 
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as  ConSeqOpt  in  the  non-structured  output  case.  We  will  investigate  this  direction  in  the 
near  future. 
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APPENDIX 


Proofs  of  SCP  Theoretical  Results 

This  appendix  contains  the  proofs  of  the  various  theoretical  results  presented  in  this  paper. 

A.l  Preliminaries 

We  begin  by  proving  a  number  of  lemmas  about  monotone  submodular  functions,  which  will 
be  useful  to  prove  our  main  results. 

Lemma  1.  Let  A  be  a  set  and  f  be  a  monotone  submodular  function  defined  on  list  of  items 
from  A.  For  any  lists  A ,  B,  we  have  that: 

f(A  ®B)~  f(A)  <  \B\(Es^u{B)[f(A  ©  s)]  -  f(A )) 

for  U(B )  the  uniform  distribution  on  items  in  B. 

Proof.  For  any  list  A  and  B,  let  B{  denote  the  list  of  the  first  i  items  in  B,  and  bi  the  ith 
item  in  B.  We  have  that: 

f(A®  B)  —  f(A) 

<  XE'i  f(A®bi)-f(A) 

=  \B\(Eb^u{B)[f(A®b)}-f(A)) 

where  the  inequality  follows  from  the  submodularity  property  of  /.  □ 

Lemma  2.  Let  A  be  a  set,  and  f  a  monotone  submodular  function  defined  on  lists  of  items 
in  A.  Let  A ,  B  be  any  lists  of  items  from  A.  Denote  Aj  the  list  of  the  first  j  items  in  A, 
U(B)  the  uniform  distribution  on  items  in  B  and  define  6j  —  ’ESrsjjj^)[f(Aj- 1  ©  s)]  —  f(Aj), 
the  additive  error  term  in  competing  with  the  average  marginal  benefits  of  the  items  in  B 
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when  picking  the  jth  item  in  A  (which  could  be  positive  or  negative).  Then: 


\A\ 

1(A)  >  (1  -  (1  -  1/l-BI )'a')1(B)  -  £(1  - 


2=1 


In  particular  if  \A\  =  \B\  =  k,  then: 

k 

f{A)  >  (1  -  1  /e)f(B)  -  £(1  -  1  fk)k~% 

2=1 

and /or  a:  =  exp(— |H|/|B|)  (/.e.  \A\  =  \B\\og{l/a)): 

\A\ 

f(A)  >  (1  -  a)f(B)  -  £(1  -  l/IBI)'-4'-^ 

2=1 

Proof.  Using  the  monotone  property  and  previous  lemma  1,  we  must  have  that:  f(B)  — 
f(A)  <  f(A  @B)~  f(A)  <  \B\(E^u{b)  l f(A  ©  6)]  -  f(A)). 

Now  let  A j  =  f(B )  —  f(Aj).  By  the  above  we  have  that 

A? 

<  \B\[Ksr^u^[f(Aj  ®  s)]  —  f(Aj)} 

=  \B\[Es^u(Bj[f(Aj  ®  s)]  -  f(Aj+i) 

+f(Aj+1)-f(B)  +  f(B)-f(Aj)] 

=  l-^l[ej+i  +  A  j  —  Aj+i] 


Rearranging  terms,  this  implies  that  Aj+i  <  (1  —  1/|£>| )  A;  +  e:J+i .  Recursively  expanding 
this  recurrence  from  A|^|,  we  obtain: 

\A\ 

Vi  <  a-  i/|b|)m +  £(i -  VI VA|V 

2=1 

Using  the  definition  of  A|^|  and  rearranging  terms,  we  obtain  f(A)  >  (1  —  (1  —  l/\B\)\A\^f(B)  — 
—  l/\B\)\A\-lei.  This  proves  the  first  statement  of  the  theorem.  The  following  two 
statements  follow  from  the  observations  that  (1  —  1/|£>|)IAI  =  exp(m  log(l  -  1/|R|))  < 
exp(-|A|/|B|)  =  a.  Hence  (1  -  (1  -  l/\B\)\A\)f(B)  >  (1  -a)f(B).  When  \A\  =  \B\,a  =  l/e 
and  this  proves  the  special  case  where  |^4|  =  \B\.  □ 

For  the  greedy  list  construction  strategy,  the  ej  in  the  last  lemma  are  always  <  0,  such 
that  Lemma  2  implies  that  if  we  construct  a  list  of  size  k  with  greedy,  it  must  achieve  at  least 
63%  of  the  value  of  the  optimal  list  of  size  fc,  but  also  that  it  must  achieve  at  least  95%  of 
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the  value  of  the  optimal  list  of  size  |_/c/3j ,  and  at  least  99.9%  of  the  value  of  the  optimal  list 
of  size  [fc/Tj . 

A  more  surprising  fact  that  follows  from  the  last  lemma  is  that  constructing  a  list  stochas¬ 
tically,  by  sampling  items  from  a  particular  fixed  distribution,  can  provide  the  same  guarantee 
as  greedy: 

Lemma  3.  Let  A  be  a  set,  and  f  a  monotone  submodular  function  defined  on  lists  of  items 
in  A.  Let  B  be  any  list  of  items  from  A  and  U(B )  the  uniform  distribution  on  elements  in  B . 
Suppose  we  construct  the  list  A  by  sampling  k  items  randomly  from  U(B )  (with  replacement). 
Denote  Aj  the  list  obtained  after  j  samples,  and  Pj  the  distribution  over  lists  obtained  after 
j  samples.  Then: 

EW/(A)]  >  (1  —  (1  —  l/\B\)k)f(B) 

In  particular,  for  a  =  exp(— k/\B\): 

EA^Pk[f(A)]>(l-a)f(B) 

Proof.  The  proof  follows  a  similar  proof  to  the  previous  lemma.  Recall  that  by  the  monotone 
property  and  lemma  1,  we  have  that  for  any  list  A:  f(B)  —  f(A)  <  f(A  ©  B)  —  f(A)  < 
|£|(E6  ~U(B)[f(A  ©  b)}  —  f{A)).  Because  this  holds  for  all  lists,  we  must  also  have  that 
for  any  distribution  P  over  lists  A,  f(B)  —  E A~p[f(A)\  <  \B\EA~p[^b~u(B)[f  (A  ©  b)]  — 
f(A)].  Also  note  that  by  the  way  we  construct  sets,  we  have  that  KAj+1~pj+1[f(Aj+ 1)]  = 
E Aj~Pj  [Es^t/(B)  [/ (Aj  ©  s)]] 

Now  let  Aj  =  f(B )  —  Ea^Pj  [f(A:J)}.  By  the  above  we  have  that: 

A/ 

<  \B\EAj~p, [^s~u(B)[f(Aj  ©  s)]  -  f(Aj )] 

=  \B\EA.„p.\Ea„uiB)[f(Aj  ©  s)]  -  f(B ) 

+f(B)  —  f(Aj)\ 

=  \B\(EAj+1^Pj+1[f(Aj+1)]-  f(B) 

+f(B)-EAj„Pj\J(Aj)\) 

=  |5|[Aj-Aj+1] 

Rearranging  terms,  this  implies  that  AJ+i  <  (1  —  l/\B\)Aj.  Recursively  expanding  this 
recurrence  from  A we  obtain: 


Ak  <  (1  —  l/\B\)kAo 


Using  the  definition  of  Ak  and  rearranging  terms  we  obtain  EAr^pk[f(A)\  >  (1  —  (1  — 
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l/\B\)k)f(B).  The  second  statement  follows  again  from  the  fact  that  (1  —  (1  —  l/\B\)k)f(B)  > 
(l-a)f(B)  □ 

Corollary  3.  There  exists  a  distribution  that  when  sampled  k  times  to  construct  a  list , 
achieves  an  approximation  ratio  of  (1  —  1/e)  of  the  optimal  list  of  size  k  in  expectation.  In 
particular,  if  A*  is  an  optimal  list  of  size  k,  sampling  k  times  from  U(A *)  achieves  this 
approximation  ratio.  Additionally,  for  any  a  G  (0,1],  sampling  |"fclog(l/a)~|  times  must 
construct  a  list  that  achieves  an  approximation  ratio  of  (1  —  a)  in  expectation. 

Proof.  Follows  from  the  last  lemma  using  B  =  A*.  □ 

This  surprising  result  can  also  be  seen  as  a  special  case  of  a  more  general  result  proven  in 
prior  related  work  that  analyzed  randomized  set  selection  strategies  to  optimize  submodular 
functions  (Lemma  2.2  in  [Feige  et  al.  2011]). 


A. 2  Proofs  of  Main  Results 

We  now  provide  the  proofs  of  the  main  results  in  this  paper.  We  provide  the  proofs  for  the 
more  general  contextual  case  where  we  learn  over  a  hypothesis  class  II.  All  the  results  for 
the  context-free  case  can  be  seen  as  special  cases  of  these  results  when  II  =  IT  =  {^\a  G  ^4} 
and  '0(eomputeFeatures(Sf,  d))  =  a  for  any  example  d  and  list  S. 

We  refer  the  reader  to  the  notation  defined  in  Chapter  2  and  3  for  the  definitions  of  the 
various  terms  used. 

Theorem  1.  Let  a  =  exp(— N/K)  and  K'  =  min(N,K).  After  T  iterations,  for  any  5,5'  G 
(0, 1),  we  have  that  with  probability  at  least  1  —  S: 

Fg,,N)  >  (1  -  a)F(S;}K)  -  £  - 
and  similarly,  with  probability  at  least  1  —  5  —  5' : 

F@,  N)>  (1  -  c,)F(S;iK )  -  ¥  - 

2v/HpZI 
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Proof. 


Fty,N) 

=  £EtT=i  F(iiH,N) 

~  T  Et=l  d)]] 

=  (l-a)Ed^[/(^,d)] 

-(l-a)Ed^[/(S;^,d)] 

~T  ELi  [Ed~p[/(^,jv,d)]] 

Now  consider  the  sampled  examples  {dt}J=1  and  the  hypotheses  if t,i  sampled  i.i.d.  from  ift  to 
construct  the  lists  {St}f=i  and  denote  the  random  variables  Xt  —  (1  —  <a)(Ed^ 

JO  d)]- 

f{S^K,  dt))  -  E5^Ar^t[Ed^x)[/(5^>jv,d)]]  -  f(St,  dt)].  If  ipt  is  deterministic,  then  simply 
consider  all  ipt,i  =  'ift-  Because  the  dt  are  i.i.d.  from  V ,  and  the  distribution  of  hypotheses 
used  to  construct  St  only  depends  on  {d r}r=i  and  {SV}t= i>  then  the  X*  conditioned  on 
{XT}^111  have  expectation  0,  and  because  /  G  [0, 1]  for  all  d  G  P,  Xt  can  vary  in  a  range 
r  C  [—2,2].  Thus  the  sequence  of  random  variables  Yt  —  ^=1Xi,  for  t  =1  to  T,  forms 
a  martingale  where  \Yt  —  Yt+i\  <  2.  By  the  Azuma-Hoeffding’s  inequality,  we  have  that 
P(Yt/T  >  e)  <  exp(— e2T/8).  Hence  for  any  5  G  (0,1),  we  have  that  with  probability  at 
least  1  —  5,  Yt/T  <  2 Hence  we  have  that  with  probability  at  least  1  —  5: 

F@,N) 

=  (l-a)Ed^[/(5;^,d)] 

-[(l-a)Ed^[/(^,d)] 

Efrll  ESV,jv~^t  [Ed~D[/(^,JV,  d)]]] 

=  (l-a)Ed^[/(5;^,d)] 

—  [(1  —  a)^r  d) 

—  f5  EHi  /(‘S'tjdt)]  —  Yt/T 
=  (l-a)Ed^[/(5;)K,d)] 

-[(l-«)^Ef=i/(^;dt) 

-tE^i  /($,  dt)]-2v/^pM 
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Let  Wi  =  (1  —  1  /K)N  \  From  Lemma  2,  we  have: 


(i  -  eLi  /(s;,*,  dt)  - 1  eLi  f(st,  dt) 

<  T  ELl  Eili  Wj iJf )  [/ (St,i-l)  ©  dt))] 

dt)) 

=  EiP~u(S^k)[t  Td=iYliLi'Wi(f(St,i-i  ®ip(St,i- I,dt)) 

-/(5M,dt))] 

<  max^en[^  J2t=i  E^Li  wi(f(St,i-i  ®t/)(St,i- i,dt)) 

-/(5M,dt))] 

-/(5t(i,dt))] 

=  i?/T 

Hence  combining  with  the  previous  result  proves  the  first  part  of  the  theorem. 

Additionally,  for  the  sampled  examples  {dt}J=1  and  the  hypotheses  W’  consider  the 
random  variables  QN(t-i)+i  =  WiE^t[f(St,i-i  ®  i,dt))]  -  ^/(%,dt).  Because 

each  draw  of  is  i-i.d.  from  we  have  that  again  the  sequence  of  random  variables 
Zj  =  Qi,  for  j  =  1  to  TN  forms  a  martingale  and  because  each  Qi  can  take  values  in  a 
range  [—Wj,Wj]  for  j  =  l  +  mod(z  —  1,  N),  we  have  |Z^  — Z^_i|  <  wj.  Since  \%i~ ^i-i|2  < 
~  1  /K)2(N~^  <  Tmin(iL,  N)  =  TiL',  by  Azuma-Hoeffding’s  inequality,  we  must 
have  that  P(Ztn  >  e)  <  exp(— e2 /2TK').  Thus  for  any  S'  G  (0, 1),  with  probability  at  least 
1  —  5',  Ztev  <  \J2TK’  ln(l/5).  Hence  combining  with  the  previous  result,  it  must  be  the  case 
that  with  probability  at  least  1  —  5  —  5',  both  Yt/T  <  2y /21n^/^I  and  Ztn  <  \J2TK'  ln(l/5') 
holds. 

Now  note  that: 

maxV>efi[t  ELi  Eil l  Wi(f(St,i-i  ©  dt))  -  dt))] 

=  Ef=i  Eili  w*(f(S't)*_i  ©  dt)) 

[/(S^-i  ©  V’/(S't,j-i,  dt))])]  +  Ztn/T 

=  E[R]/T  +  Ztn/T 

Using  this  additional  fact,  and  combining  with  previous  results  we  must  have  that  with 
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probability  at  least  1  —  5  —  5': 


F@,N) 

>  (1  -  a)F(S;>K)  -  [(1  -  a)  i  ELi  f(^K,  dt) 

-h  eT=  i  m,  dt)]  -  2^^ 

>  (1  -  a)F(S;tK)  -  E [R\/T  -  ZTn/T  -  2 

>  (l-a)F(Sy-E[fl]/T-/M 

2y®M 

□ 

We  now  show  that  the  expected  regret  must  grow  with  \fK!  and  not  K',  when  using 
Weighted  Majority  with  the  optimal  learning  rate  (or  with  the  doubling  trick). 

Corollary  1.  Under  the  event  where  Theorem  1  holds  (the  event  that  occurs  w.p.  1  —  5  —  8'), 
if  n  is  a  finite  set  of  hypotheses,  using  Weighted  Majority  with  the  optimal  learning  rate 
guarantees  that  after  T  iterations: 

m]/T  <  uqiLjnl  + 

+2s/4(A"/r)3/4(ln(l/y))1/1v/ln|n| 

For  large  enough  T  in  Q(Kf( In  |n|  +  ln(l/5')));  we  obtain  that: 

m\/T  <  o(^/K/1ji|n|) 

Proof.  We  use  a  similar  argument  as  Streeter  &  Golovin  in  Lemma  4  in  [Streeter  and 
Golovin  2008]  to  bound  E  [R]  in  the  result  of  Theorem  1.  Consider  the  sum  of  the  ben¬ 
efits  accumulated  by  the  learning  algorithm  at  position  i  in  the  list,  for  i  E  1,2,  ...,7V, 
i.e.  let  yi  =  J2t=i  dt)|5t,i-i>  dt),  where  xft,i  corresponds  to  the  particular  sam¬ 

pled  hypothesis  by  Weighted  Majority  for  choosing  the  item  at  position  z,  when  construct¬ 
ing  the  list  St  for  example  dt.  Note  that  —  1  /K)N~lyi  <  Y^iLiVi  <  T  by  the 

fact  that  the  monotone  submodular  function  /  is  bounded  in  [0, 1]  for  all  examples  d. 
Now  consider  the  sum  of  the  benefits  you  could  have  accumulated  at  position  z,  had  you 
chosen  the  best  fixed  hypothesis  in  hindsight  to  construct  the  entire  list,  keeping  the  hy¬ 
pothesis  fixed  as  the  list  is  constructed,  i.e.  let  zi  =  J2t=i  &(/0*(*St,i-i>  dt)|St,i-i,  dt),  for 
V  =  arg  max^gjj  (1  -  l/K)N~l  Ej=i  dt) | -S'*,*— 1 ,  dt)  and  let  rl  =  -  y{. 

Now  denote  Z  =  fUTiO-  —  pKp~Ti-  We  have  Z 2  =  X^Li(l  —  1  /K)N~lZi  =  X^=i(l  — 
1  /K)N~l{yi  +  rf)  <  T  +  i?,  where  R  is  the  sample  regret  incurred  by  the  learning  algorithm. 
Under  the  event  where  theorem  1  holds  (i.e.  the  event  that  occurs  with  probability  at  least 
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1  —  5  —  5'),  we  had  already  shown  that  R  <  E [R]  +  Ztn,  for  ^tn  <  yf  2TK '  ln(l / 5') ,  in 
the  second  part  of  the  proof  of  theorem  1.  Thus  when  theorem  1  holds,  we  have  Z 2  < 
T  +  y/ 2TK '  ln(l/5')  +E[i?].  Now  using  the  generalized  version  of  weighted  majority  with  re¬ 
wards  (i.e.  using  directly  the  benefits  as  rewards)  [Arora  et  al.  2012],  since  the  rewards 
at  each  update  are  in  [0 ,  iF'],  we  have  that  with  the  best  learning  rate  in  hindsight  1: 
E[i2]  <  2zfx’  ln|n|.  Thus  we  obtain  Z2  <  T  +  y/2TK'  ln(l/<5')  +  2 zfx'  ln|fl|.  This 

is  a  quadratic  inequality  of  the  form  Z 2  —  2Z \J I\'  In  | hi  —  T  —  \J2TK'  In  (1/5')  <  0,  with  the 
additional  constraint  Z  >  0.  This  implies  Z  is  less  than  or  equal  to  the  largest  non-negative 
root  of  the  polynomial  Z2  —  2 Z\J Kr  In  |fi|  —  T  —  y/2TKf  ln(l/5').  Solving  for  the  roots,  we 
obtain 

Z  <  \J  K'  In  |n|  +  ^K'lnlfil  +  T+  v/2Ti^ln(l  JF) 

<  2y/iCln|II|  +  VT+  (2TK,ln(l/5/))1/4 

Plugging  back  Z  into  the  expression  E[i2]  <  2 Z\J~K'  In  |fl|,  we  obtain: 

E [R}<  \K'  ln|fl|  +  2y/TiP'ln|n| 

+2(2Tln(l/5'))1/4(-^/)3/4V/ln  |fi| 

Thus  the  average  regret: 

E [R]  ^  AK'  In  |fl|  0  I K'  hTjffj 
T  —  T  '  T 

+29/4 (. K'/T )3/4 (ln(  1  /S') ) 1/4 yjln  |ft| 

For  T in  e(i'T/(lnn+ln(l/5/))),  the  dominant  term  is  2\J  K  and  thus  is  0{\] ^M)_ 

□ 

Corollary  2.  Let  a  —  exp(— N/K)  and  K'  =  min(iV, K).  If  we  run  an  online  learning 
algorithm  on  the  sequence  of  convex  loss  Ct  instead  of  It,  then  after  T  iterations,  for  any 
5  E  (0, 1),  we  have  that  with  probability  at  least  1  —  5; 

F®,  N)  >  (1  -  a)F(SlK)  -  ^  -  2^M>  _  g 

where  R  is  the  regret  on  the  sequence  of  convex  loss  Ct,  and  Q  —  ^EfcLiC^W't)  —  Ct(ipt))  + 
min^Epj  J2t=i  Ct{^)  ~  minp/en  ^t=i  is  the  “ convex  optimization  gap”  that  measures 

how  close  the  surrogate  losses  Ct  is  to  minimizing  the  cost- sensitive  losses  If 

Proof.  Follows  immediately  from  Theorem  1  using  the  definition  of  R,  R  and  Q ,  since  Q  — 

xif  not  a  doubling  trick  can  be  used  to  get  the  same  regret  bound  within  a  small  constant  factor  [Cesa- 
Bianchi  et  al.  1997] 
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R-R 

T 


□ 
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APPENDIX 


Proof  of  Monotone  Submodularity  of 

Quality  Function 


Consider  the  quality  function  which  scores  a  set  of  predictions  for  an  input  image  by  the  loss 
of  the  best  prediction  in  the  set.  Using  the  same  notation  in  Chapter  4,  the  quality  function 
is  reproduced  as: 


f(Ys(I),ygt )  =  max  {g{Hi(I),  y^)},  (B.l) 

=  1  -  min  U(hi(I),  ygt)}  (B.2) 

The  above  equation  scores  the  sequence  of  structured  predictions  Ys(I)  —  {fci(I)}iei...N 
by  the  score  of  the  best  prediction  produced  by  the  predictors  S  —  {/q,  ^2 , . . . ,  h.  Such  a 


function  /  was  proved  to  be  monotone,  submodular  by  Dey  et  al  in  [Dey  et  al.  2012].  We 
reproduce  the  proof  here  for  convenience  while  adapting  the  exposition  to  the  specific  usage 
in  our  case: 

A  set  function  /  which  maps  subsets  A  C  A  of  a  finite  set  A  to  the  real  numbers.  /  is 
called  submodular  if,  for  all  A  C  B  C  A  and  S  G  A  \  B  it  holds  that 

f{A  @S)-  f(A )  >  f(B  ©  S)  -  f(B )  (B.3) 

where  ©  is  the  concatenation  operator.  Such  a  function  is  monotone  if  it  holds  that  for  any 
sets  Si,  S2  G  A,  we  have 


f(S  1)  <  f(S1  ©  S2)  (B.4) 

f(S2)  <  f(Si  ©  S2) 
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We  want  to  prove  that  /  (B.2)  is  monotone,  submodular.  We  make  B.2  more  general  by 
replacing  the  loss  of  a  particular  prediction  l(hi(I))  with  cost(a^)  where  ai  is  a  particular 
item.  The  simplified  equation  is: 


/  =  1  —  min{cost(cq)}  (B.5) 

a  ieA 

where  A  is  the  set  of  allowed  items. 

This  can  be  proved  if  mina.Gk/4{cost(a*)}  is  monotone  supermodular.  A  function  /  is 
supermodular  if  it  holds  that 

f{A  ®S)-  f(A )  <  f(B  ©  S)  -  f(B)  (B.6) 

Theorem  2.  The  function  min {cost(ai)}  is  monotone,  supermodular  where  a{  are  predic¬ 
ate  A 

tions. 

Proof.  Submodularity:  Assume  that  we  are  given  sets  A  C  B  C  A,  S  G  A  \  B.  We  want 
to  prove  the  inequality  in  B.6.  Let  R  —  B\A,  the  set  of  elements  that  are  in  B  but  not  in 
A.  Since  A  ©  R  —  B  we  can  now  rewrite  B.6  as 

f(A  ®S)~  f(A)  <  f(A  ®  R®  S)  —  f(A  ®  R)  (B.7) 

We  refer  to  the  left  and  right  sides  of  B.7  as  LHS  and  RHS  respectively.  Define  a*  as  the 
prediction  which  has  the  least  cost.  Hence  there  can  be  three  cases: 

•  Case  1:  a*  G  A  In  this  case  LHS  =  RHS  =  0 

•  Case  2:  a*  G  R  In  this  case  RHS  >  LHS 

•  Case  3:  a*  G  S  In  this  case  RHS  >  LHS 

Since  in  all  possible  cases  it  can  be  seen  that  RHS  is  greater  than  or  equal  to  LHS  it 

is  proved  that  min  {cost  (a^)}  is  supermodular.  Note  that  if  there  are  multiple  predictions 

a-i  eA 

which  have  the  same  minimum  cost  as  a*  then  similar  arguments  still  hold  and  even  in  the 
worst  case  when  they  are  distributed  across  S ,  R  and  A ,  Case  1  holds. 

Monotonicity  Consider  two  sequences  Si  and  S2.  Define  a*  as  the  predictions  which 
has  the  least  cost.  We  want  to  prove  that  min{cost(a^)}  is  monotone  decreasing,  i.e. 

aieA 

f{Si)>f{Sl@S2 )  (B.8) 

/(<S2)>/(<Si©<S2) 
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There  are  three  possible  cases: 

•  Case  1:  a*eSi  =►  /(Si)  =  /(Si  ©  S2)  and  /(S2)  >  /(Si  ©  S2) 

•  Case  2:  a*  €  S2  =►  /(Si)  >  /(Si  ©  S2)  and  /(S2)  =  /(Si  ©  S2) 

•  Case  3:  a*  €  Si  ©  S2  =►  /(Si)  =  /(Si  ©  S2)  and  /(S2)  =  /(Si  ©  S2) 

Since  in  all  possible  cases  the  conditions  in  B.8  are  satisfied  min  {cost (a,  )  }  is  monotone 

decreasing.  □ 

Corollary  3.  The  function  f  of  Equation  J^.3  in  the  paper  is  monotone,  submodular  due  to 
min  {cost (a/)}  being  monotone,  supermodular  by  Theorem  2. 

cL-i  (zA 

Corollary  4.  The  function  F(S,V)  =  E(j?y  [f(Ys(I),ygt)]  (Equation  4-4  tn  the  paper) 
is  also  monotone  submodular  since  non-negative  sums  of  monotone  submodular  functions  is 
also  monotone  submodular. 
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